Date of Award
5-2026
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Engineering
Committee Chair/Advisor
Adam Hoover
Committee Member
Jon Calhoun
Committee Member
Feng Luo
Committee Member
Lan Zhang
Abstract
This dissertation describes methods to analyze lengthy recordings of data in order to detect sparsely occurring activities. The narrative below describes the progression of research that led to the development of these methods and their generalization into a unified framework. My research started with designing models for dietary monitoring, including detecting meals from day-long recordings and detecting intake gestures from meal-length recordings. Both tasks share some common characteristics: (a) the target event takes only a small portion of data recordings, and (b) there is global context within full-length data recordings that can help a model make better decisions. After finishing these projects and making several publications, we summarize a general problem definition for this group of tasks and a general methodological pipeline to effectively leverage global context within full-length data recordings to improve model performance. We define events that are either temporally brief or infrequent in data as sparse events. Traditional methods for detecting sparse events are inherited from the general time-sequence analysis pipeline and rely on sliding-window classifiers. This pipeline slices recordings into short windows and classifies each window as an independent sample. Most existing works focus on extracting richer representations from each window, leading to a variety of models based on CNNs, RNNs, and transformers. However, through our exploration of detecting meals from day recordings and detecting eating gestures from meal recordings, we found that window-based classifiers struggle with false positives and limited context, particularly when only short segments are analyzed in isolation. These limitations stem from the scarcity of positive samples and temporal ambiguity between events and background activity. These natural weaknesses of window-based classifiers motivated our research to explore the potential of including broader recording-level context for event detection. A few researchers have noted the advantages of jointly analyzing neighboring windows, but such approaches are constrained by increased computational demands and the need for larger datasets. When treating multiple windows as a single sample, models must be scaled up to handle larger input sizes, and each recording yields fewer samples than in single-window approaches. The proposed unified framework for sparse-event detection achieves global-pattern modeling on full-length recordings while maintaining efficiency in both data usage and computation. The core idea is to combine a local feature encoder, which compresses window-based data into smaller vectors, with a global detector, which captures long-range dependencies and recording-level contextual patterns. To address the challenge of limited data, we propose a novel augmentation method that generates synthetic global patterns and improves the generalization capacity of global detectors. Chapters 1-3 incrementally introduce my work on developing a framework specifically for two tasks: detecting meals and detecting intake gestures. Each case begins with data preparation and observation, followed by methodological refinements tailored to the dataset characteristics. Chapter 1 presents a global-local detection model for eating episode detection from full-day wrist-motion data and shows that modeling daily patterns reduces false positives and improves generalizability. Chapter 2 introduces a video dataset tailored to sparse intake-gesture detection, which forms the foundation for intake-gesture recognition in free-living environments. Chapter 3 presents a globallocal detection model applied to meal-length videos and demonstrates that leveraging full-meal context improves performance, particularly precision and training stability. In Chapter 4, we evaluate our global pattern analysis framework under a more systematic and scalable structure. We apply a similar idea about global context to two additional case studies: sleep-stage detection from EEG signals and speech/music detection from TV-show audio. These studies expand the framework across modalities (vision, motion, biology, and audio), recording durations (from 30 minutes to 24 hours), and dataset sizes (from 300 to 2000 recordings). We summarize a standard framework, examine performance gains across different conditions, and explore both the potential and limitations of our framework.
Recommended Citation
Tang, Zeyu, "Learning Global Context for Sparse Activity Recognition in Lengthy Recordings with Limited Dataset Size" (2026). All Dissertations. 4232.
https://open.clemson.edu/all_dissertations/4232
Author ORCID Identifier
0000-0002-2747-4615
Included in
Artificial Intelligence and Robotics Commons, Computer Engineering Commons, Signal Processing Commons, Theory and Algorithms Commons