Date of Award
5-2026
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Electrical and Computer Engineering
Committee Chair/Advisor
Dr.Tao Wei
Committee Member
Dr. Fatemeh Afghah
Committee Member
Dr. Rong Ge
Committee Member
Dr. Judson Ryckman
Abstract
High-performance computing (HPC) is changing rapidly as scientific simulations and large language model (LLM) workloads push the need for higher performance under tight power and memory constraints. Conventional platforms such as CPUs and GPUs accelerate computation through instruction-driven parallelism, relying on multithreading, SIMD, and SIMT execution, but increasingly encounter scalability limits imposed by the power and memory walls. In contrast, Field-Programmable Gate Arrays (FPGAs) and Neural Processing Units (NPUs) offer a high-efficiency alternative through dataflow-oriented architectures that exploit deep pipelining and customized memory hierarchies to reduce data movement. However, the performance potential of these spatial accelerators remains largely unrealized when traditional, control-flow-centric algorithms are directly mapped onto them.
This dissertation addresses this gap by developing domain-specific algorithm and dataflow designs tailored for FPGAs and NPUs, demonstrating that hardware–software co-design is essential for achieving high performance and energy efficiency on modern accelerators. Two representative and challenging applications are studied: electromagnetic simulation using the Finite-Difference Time-Domain (FDTD) method and on-device inference for large language models. FDTD simulations are critical for the design of photonic integrated circuits but are computationally intensive, while on-device LLM inference requires low latency and low power consumption.
For FDTD, this work introduces a time-pipelined computation that significantly reduces data movement and enables scalable execution across FPGA networks and reconfigurable accelerators, substantially shortening photonic design cycles. For large language models, it demonstrates how reorganizing computation and dataflow allows NPUs to process long sequences efficiently, achieving lower latency and energy consumption than existing approaches.
Recommended Citation
Yu, Miaoxiang, "Domain-Specific Design on FPGA and NPU for scientific computing and On-Device LLM Inference" (2026). All Dissertations. 4205.
https://open.clemson.edu/all_dissertations/4205
Author ORCID Identifier
0000-0002-4382-9009