Date of Award

12-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Automotive Engineering

Committee Member

Dr. Bing Li

Committee Member

Dr. Rahul Rai

Committee Member

Dr. Siyu Huang

Committee Member

Dr. Federico Iuricich

Abstract

The rapid progress of 3D computer vision has enabled a wide range of applications in autonomous driving, robotics, and augmented reality. Despite this growth, training robust 3D perception models remains challenging due to limited labeled data, the complexity of integrating multiple modalities, and the inherently imbalanced and long-tailed nature of 3D datasets. This dissertation addresses these challenges by proposing data-efficient, multi-modal learning frameworks that improve the accuracy, generalization, and scalability of 3D scene understanding.

In the semi-supervised setting, this work presents novel approaches that combine limited annotations with large amounts of unlabeled data to enhance 3D object classification and retrieval. A key contribution is the Multimodal Contrastive Prototype (M2CP) loss, which encourages discriminative and modality-consistent feature representations. Additionally, instance-level consistency constraints ensure robustness across varying input modalities. These methods achieve superior performance on standard benchmarks, demonstrating improved label efficiency and generalization.

To further reduce reliance on labeled data, this dissertation explores label-free 3D semantic segmentation by utilizing pseudo-labels from vision-language foundation models. To address label noise and class imbalance, a geometric-guided noise separation strategy and a prototype-based contrastive framework are introduced, resulting in notable improvements in segmentation accuracy across both indoor and outdoor scenes.

For self-supervised learning, this work proposes Bridge3D, a novel framework that leverages semantic masks, image captions, and pre-trained 2D features to guide 3D scene representation learning. A foreground-aware masking strategy enhances semantic focus, leading to stronger 3D representations. This approach achieves state-of-the-art results on multiple downstream tasks, including semantic segmentation and object detection.

To align features across modalities, a two-stage masked token prediction framework is proposed. It incorporates semantic masks from segmentation models and uses group-balanced reweighting to mitigate long-tailed class distributions, facilitating effective cross-modal feature alignment.

Finally, the dissertation introduces a 3D geometric representation learning method based on multi-view masked autoencoding. By projecting 3D point clouds into feature-level 2D views, the method captures rich geometric and semantic context without requiring additional supervision. A multi-scale, multi-head attention mechanism further strengthens representation learning, enabling more robust geometric understanding.

Together, these contributions advance the field of 3D computer vision by addressing key limitations in data efficiency and multi-modal learning. The proposed frameworks offer practical and scalable solutions for real-world deployment in autonomous systems and intelligent robotics.

Recommended Citation

Chen, Zhimin, "Multi-modal Data-efficient Learning for 3D Machine Vision" (2025). All Dissertations. 4117.
https://open.clemson.edu/all_dissertations/4117

Download

Included in

Automotive Engineering Commons, Computer Engineering Commons

COinS

All Dissertations

Multi-modal Data-efficient Learning for 3D Machine Vision

Date of Award

Document Type

Degree Name

Department

Committee Member

Committee Member

Committee Member

Committee Member

Abstract

Recommended Citation

Included in

Search

Browse by

Useful Links

All Dissertations

Multi-modal Data-efficient Learning for 3D Machine Vision

Author

Date of Award

Document Type

Degree Name

Department

Committee Member

Committee Member

Committee Member

Committee Member

Abstract

Recommended Citation

Included in

Share

Search

Browse by

Useful Links