Date of Award
5-2026
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Electrical and Computer Engineering (Holcomb Dept. of)
Committee Chair/Advisor
Fatemeh Afghah
Committee Member
Melissa Crawley Smith
Committee Member
Abolfazl Razi
Committee Member
Xiaolong Ma
Committee Member
Tao Wei
Abstract
Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have recently emerged as powerful frameworks for learning joint representations across visual and textual modalities. These models enable a wide range of applications, including visual recognition, multimodal reasoning, and visual question answering. However, adapting large pre-trained VLMs to downstream tasks while preserving their strong generalization ability remains a significant challenge, particularly under domain shifts or limited supervision. This dissertation focuses on developing methods for generalizable adaptation of VLMs, aiming to improve robustness, efficiency, and applicability across diverse tasks and environments.
First, this work introduces novel prompt learning strategies for adapting CLIP-style VLMs while maintaining strong out-of-distribution generalization. In particular, we propose two frameworks, Style-Pro and DiSa, which address domain bias and overfitting in prompt learning. Style-Pro incorporates a style-guided prompt learning mechanism that synthesizes diverse style representations to reduce discrepancies between training and unseen domains. DiSa introduces directional saliency-aware regularization that enhances cross-modal alignment and encourages the model to focus on semantically important visual regions, improving robustness under limited data settings.
Second, this dissertation investigates efficient and generalizable multimodal in-context learning (ICL). To address the high inference cost and instability of demonstration-based prompting in MLLMs, we propose Hyper-ICL, a lightweight framework that reconstructs ICL behavior through attention-level adaptation. Hyper-ICL decomposes the effects of demonstrations within the attention mechanism and introduces query-adaptive modulation together with hyperbolic anchor distillation, enabling compact task representation while preserving multimodal reasoning capability.
Finally, the dissertation demonstrates the practical impact of VLMs in real-world applications. Two representative domains are explored. The first is wildfire monitoring, where a new benchmark dataset called WildFireVQA is introduced to evaluate multimodal reasoning using synchronized RGB and radiometric thermal aerial imagery. The benchmark enables systematic evaluation of multimodal models for tasks such as fire detection, hotspot localization, and environmental analysis. The second application focuses on industrial visual anomaly detection. To address limitations of existing approaches, the proposed Qwen-AD framework introduces a modular adaptation strategy using task-specialized LoRA experts and a dynamic gating mechanism for multi-task anomaly understanding in MLLMs.
Extensive experiments across multiple benchmarks demonstrate that the proposed methods significantly improve generalization, efficiency, and robustness in vision-language learning. Overall, this dissertation advances the development of adaptable multimodal systems capable of operating reliably across diverse domains and real-world scenarios.
Recommended Citation
Alipour Talemi, Niloufar, "Generalizable Adaptation for Vision-Language Models" (2026). All Dissertations. 4236.
https://open.clemson.edu/all_dissertations/4236
Author ORCID Identifier
0009-0000-6881-3671