Date of Award

12-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

School of Computing

Committee Chair/Advisor

Abolfazl Razi

Committee Member

Kai Liu

Committee Member

Nianyi Li

Committee Member

Fatemeh Afghah

Abstract

Machine learning has become central to modern computational intelligence, driving breakthroughs across science, engineering, and daily life. In the modern era, the volume and complexity of real-world data have grown exponentially, fueled by advances in sensing, simulation, and digital interactions. This explosive data growth, while enabling powerful models, also strains computational and storage resources. These challenges are especially evident in domains such as biomedical imaging, remote sensing, autonomous systems, and large-scale language and vision models, where learning effectively under limited resources has become a critical bottleneck. While this massive data accumulation offers unprecedented learning opportunities, it also creates significant challenges in computational efficiency, data redundancy, and generalization. In such settings, understanding and leveraging diversity becomes essential for building learning systems that are both scalable and resource-efficient.

This dissertation develops a unified theoretical and algorithmic framework for diversity-aware machine learning under resource constraints. It considers two complementary perspectives: intra-set diversity, which captures variability within a selected subset, and set-to-distribution diversity, which measures how well a subset represents the overall data distribution. These perspectives are connected through three foundational theories, Rate–Distortion (RD) theory, Determinantal Point Processes (DPP), and Optimal Transport (OT), which are shown to be deeply related. The dissertation first establishes a quantitative link between RD and DPP, demonstrating that both describe the trade-off between information efficiency and representational diversity. It further reveals that solving OT under Gaussian assumptions leads to a submodular optimization form equivalent to a DPP-like kernel, thereby unifying probabilistic diversity and geometric representation within a common mathematical framework.

Building on this foundation, the dissertation proposes several algorithms that embody these principles across diverse domains, including weakly supervised histopathology, time series analysis, large language model (LLM) reinforcement learning post-training, and two OT-based methods, which extend diversity to the distributional level for vision-language model (VLM) adaptation and pruning, respectively, with theoretical analyses conducted for most of the proposed algorithms to establish their fundamental limits and provide rigorous theoretical support. Together, these studies show that diversity, both within and across distributions, serves as a unifying principle for resource-efficient and generalizable machine learning.

Author ORCID Identifier

https://orcid.org/0000-0002-8006-4383

Available for download on Thursday, December 31, 2026

Share

COinS