Date of Award

5-2026

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical and Computer Engineering (Holcomb Dept. of)

Committee Chair/Advisor

Fatemeh Afghah

Committee Member

Melissa Crawley Smith

Committee Member

Abolfazl Razi

Committee Member

Xiaolong Ma

Committee Member

Tao Wei

Abstract

Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have recently emerged as powerful frameworks for learning joint representations across visual and textual modalities. These models enable a wide range of applications, including visual recognition, multimodal reasoning, and visual question answering. However, adapting large pre-trained VLMs to downstream tasks while preserving their strong generalization ability remains a significant challenge, particularly under domain shifts or limited supervision. This dissertation focuses on developing methods for generalizable adaptation of VLMs, aiming to improve robustness, efficiency, and applicability across diverse tasks and environments.

First, this work introduces novel prompt learning strategies for adapting CLIP-style VLMs while maintaining strong out-of-distribution generalization. In particular, we propose two frameworks, Style-Pro and DiSa, which address domain bias and overfitting in prompt learning. Style-Pro incorporates a style-guided prompt learning mechanism that synthesizes diverse style representations to reduce discrepancies between training and unseen domains. DiSa introduces directional saliency-aware regularization that enhances cross-modal alignment and encourages the model to focus on semantically important visual regions, improving robustness under limited data settings.

Second, this dissertation investigates efficient and generalizable multimodal in-context learning (ICL). To address the high inference cost and instability of demonstration-based prompting in MLLMs, we propose Hyper-ICL, a lightweight framework that reconstructs ICL behavior through attention-level adaptation. Hyper-ICL decomposes the effects of demonstrations within the attention mechanism and introduces query-adaptive modulation together with hyperbolic anchor distillation, enabling compact task representation while preserving multimodal reasoning capability.

Finally, the dissertation demonstrates the practical impact of VLMs in real-world applications. Two representative domains are explored. The first is wildfire monitoring, where a new benchmark dataset called WildFireVQA is introduced to evaluate multimodal reasoning using synchronized RGB and radiometric thermal aerial imagery. The benchmark enables systematic evaluation of multimodal models for tasks such as fire detection, hotspot localization, and environmental analysis. The second application focuses on industrial visual anomaly detection. To address limitations of existing approaches, the proposed Qwen-AD framework introduces a modular adaptation strategy using task-specialized LoRA experts and a dynamic gating mechanism for multi-task anomaly understanding in MLLMs.

Extensive experiments across multiple benchmarks demonstrate that the proposed methods significantly improve generalization, efficiency, and robustness in vision-language learning. Overall, this dissertation advances the development of adaptable multimodal systems capable of operating reliably across diverse domains and real-world scenarios.

Author ORCID Identifier

0009-0000-6881-3671

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.