Date of Award

5-2026

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Engineering

Committee Chair/Advisor

Adam Hoover

Committee Member

Jon Calhoun

Committee Member

Feng Luo

Committee Member

Lan Zhang

Abstract

This dissertation describes methods to analyze lengthy recordings of data in order to detect sparsely occurring activities. The narrative below describes the progression of research that led to the development of these methods and their generalization into a unified framework. My research started with designing models for dietary monitoring, including detecting meals from day-long recordings and detecting intake gestures from meal-length recordings. Both tasks share some common characteristics: (a) the target event takes only a small portion of data recordings, and (b) there is global context within full-length data recordings that can help a model make better decisions. After finishing these projects and making several publications, we summarize a general problem definition for this group of tasks and a general methodological pipeline to effectively leverage global context within full-length data recordings to improve model performance. We define events that are either temporally brief or infrequent in data as sparse events. Traditional methods for detecting sparse events are inherited from the general time-sequence analysis pipeline and rely on sliding-window classifiers. This pipeline slices recordings into short windows and classifies each window as an independent sample. Most existing works focus on extracting richer representations from each window, leading to a variety of models based on CNNs, RNNs, and transformers. However, through our exploration of detecting meals from day recordings and detecting eating gestures from meal recordings, we found that window-based classifiers struggle with false positives and limited context, particularly when only short segments are analyzed in isolation. These limitations stem from the scarcity of positive samples and temporal ambiguity between events and background activity. These natural weaknesses of window-based classifiers motivated our research to explore the potential of including broader recording-level context for event detection. A few researchers have noted the advantages of jointly analyzing neighboring windows, but such approaches are constrained by increased computational demands and the need for larger datasets. When treating multiple windows as a single sample, models must be scaled up to handle larger input sizes, and each recording yields fewer samples than in single-window approaches. The proposed unified framework for sparse-event detection achieves global-pattern modeling on full-length recordings while maintaining efficiency in both data usage and computation. The core idea is to combine a local feature encoder, which compresses window-based data into smaller vectors, with a global detector, which captures long-range dependencies and recording-level contextual patterns. To address the challenge of limited data, we propose a novel augmentation method that generates synthetic global patterns and improves the generalization capacity of global detectors. Chapters 1-3 incrementally introduce my work on developing a framework specifically for two tasks: detecting meals and detecting intake gestures. Each case begins with data preparation and observation, followed by methodological refinements tailored to the dataset characteristics. Chapter 1 presents a global-local detection model for eating episode detection from full-day wrist-motion data and shows that modeling daily patterns reduces false positives and improves generalizability. Chapter 2 introduces a video dataset tailored to sparse intake-gesture detection, which forms the foundation for intake-gesture recognition in free-living environments. Chapter 3 presents a globallocal detection model applied to meal-length videos and demonstrates that leveraging full-meal context improves performance, particularly precision and training stability. In Chapter 4, we evaluate our global pattern analysis framework under a more systematic and scalable structure. We apply a similar idea about global context to two additional case studies: sleep-stage detection from EEG signals and speech/music detection from TV-show audio. These studies expand the framework across modalities (vision, motion, biology, and audio), recording durations (from 30 minutes to 24 hours), and dataset sizes (from 300 to 2000 recordings). We summarize a standard framework, examine performance gains across different conditions, and explore both the potential and limitations of our framework.

Author ORCID Identifier

0000-0002-2747-4615

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.