Date of Award

5-2026

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical and Computer Engineering

Committee Chair/Advisor

Dr.Tao Wei

Committee Member

Dr. Fatemeh Afghah

Committee Member

Dr. Rong Ge

Committee Member

Dr. Judson Ryckman

Abstract

High-performance computing (HPC) is changing rapidly as scientific simulations and large language model (LLM) workloads push the need for higher performance under tight power and memory constraints. Conventional platforms such as CPUs and GPUs accelerate computation through instruction-driven parallelism, relying on multithreading, SIMD, and SIMT execution, but increasingly encounter scalability limits imposed by the power and memory walls. In contrast, Field-Programmable Gate Arrays (FPGAs) and Neural Processing Units (NPUs) offer a high-efficiency alternative through dataflow-oriented architectures that exploit deep pipelining and customized memory hierarchies to reduce data movement. However, the performance potential of these spatial accelerators remains largely unrealized when traditional, control-flow-centric algorithms are directly mapped onto them.

This dissertation addresses this gap by developing domain-specific algorithm and dataflow designs tailored for FPGAs and NPUs, demonstrating that hardware–software co-design is essential for achieving high performance and energy efficiency on modern accelerators. Two representative and challenging applications are studied: electromagnetic simulation using the Finite-Difference Time-Domain (FDTD) method and on-device inference for large language models. FDTD simulations are critical for the design of photonic integrated circuits but are computationally intensive, while on-device LLM inference requires low latency and low power consumption.

For FDTD, this work introduces a time-pipelined computation that significantly reduces data movement and enables scalable execution across FPGA networks and reconfigurable accelerators, substantially shortening photonic design cycles. For large language models, it demonstrates how reorganizing computation and dataflow allows NPUs to process long sequences efficiently, achieving lower latency and energy consumption than existing approaches.

Author ORCID Identifier

0000-0002-4382-9009

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.