Date of Award
12-2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Automotive Engineering
Committee Member
Dr. Bing Li
Committee Member
Dr. Rahul Rai
Committee Member
Dr. Siyu Huang
Committee Member
Dr. Federico Iuricich
Abstract
The rapid progress of 3D computer vision has enabled a wide range of applications in autonomous driving, robotics, and augmented reality. Despite this growth, training robust 3D perception models remains challenging due to limited labeled data, the complexity of integrating multiple modalities, and the inherently imbalanced and long-tailed nature of 3D datasets. This dissertation addresses these challenges by proposing data-efficient, multi-modal learning frameworks that improve the accuracy, generalization, and scalability of 3D scene understanding.
In the semi-supervised setting, this work presents novel approaches that combine limited annotations with large amounts of unlabeled data to enhance 3D object classification and retrieval. A key contribution is the Multimodal Contrastive Prototype (M2CP) loss, which encourages discriminative and modality-consistent feature representations. Additionally, instance-level consistency constraints ensure robustness across varying input modalities. These methods achieve superior performance on standard benchmarks, demonstrating improved label efficiency and generalization.
To further reduce reliance on labeled data, this dissertation explores label-free 3D semantic segmentation by utilizing pseudo-labels from vision-language foundation models. To address label noise and class imbalance, a geometric-guided noise separation strategy and a prototype-based contrastive framework are introduced, resulting in notable improvements in segmentation accuracy across both indoor and outdoor scenes.
For self-supervised learning, this work proposes Bridge3D, a novel framework that leverages semantic masks, image captions, and pre-trained 2D features to guide 3D scene representation learning. A foreground-aware masking strategy enhances semantic focus, leading to stronger 3D representations. This approach achieves state-of-the-art results on multiple downstream tasks, including semantic segmentation and object detection.
To align features across modalities, a two-stage masked token prediction framework is proposed. It incorporates semantic masks from segmentation models and uses group-balanced reweighting to mitigate long-tailed class distributions, facilitating effective cross-modal feature alignment.
Finally, the dissertation introduces a 3D geometric representation learning method based on multi-view masked autoencoding. By projecting 3D point clouds into feature-level 2D views, the method captures rich geometric and semantic context without requiring additional supervision. A multi-scale, multi-head attention mechanism further strengthens representation learning, enabling more robust geometric understanding.
Together, these contributions advance the field of 3D computer vision by addressing key limitations in data efficiency and multi-modal learning. The proposed frameworks offer practical and scalable solutions for real-world deployment in autonomous systems and intelligent robotics.
Recommended Citation
Chen, Zhimin, "Multi-modal Data-efficient Learning for 3D Machine Vision" (2025). All Dissertations. 4117.
https://open.clemson.edu/all_dissertations/4117