Date of Award

5-2026

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Electrical and Computer Engineering (Holcomb Dept. of)

Committee Chair/Advisor

Tao Wei

Committee Member

Xiaoyong Yuan

Committee Member

Fatemeh Afghah

Abstract

This thesis presents a system that allows large AI models to run directly on personal devices instead of relying on cloud servers. Recent advances in artificial intelligence, especially large language models (LLMs), have made it possible to build powerful applications such as chatbots, coding assistants, and intelligent agents. However, most of these systems run in the cloud, which raises concerns about privacy, latency, and cost.

To address these issues, this work develops a local AI serving system that runs efficiently on a specialized hardware component called a Neural Processing Unit (NPU). The system provides a unified interface that supports multiple types of tasks, including text generation, image under- standing, speech recognition, and embedding-based search. It also supports advanced features such as streaming responses and tool calling, which are essential for building modern AI agents.

The system is designed to be compatible with widely used APIs, allowing existing appli- cations to use it without modification. Experimental results show that the system works correctly across different tasks and improves efficiency through techniques such as prompt caching.

Overall, this work demonstrates that it is possible to run advanced AI systems locally in a practical and efficient way, enabling faster, more private, and more flexible AI applications.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.