Date of Award
8-2024
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Healthcare Genetics
Committee Chair/Advisor
Dr.Sara Sarasua
Committee Member
Dr.Liangjiang Wang (Co-Chair)
Committee Member
Dr.Brian Dean
Committee Member
Dr.Kathleen Valentine
Abstract
The intricate interplay of genetic predisposition, environmental influences, and lifestyle acts as the multifactorial landscape of diseases. Understanding this complexity presents a significant challenge. Molecular insights into disease mechanisms, particularly the interactions of DNA, RNA, and proteins with environmental and lifestyle factors, have revolutionized disease diagnosis, prognosis, and treatment. High-throughput technologies, such as next-generation sequencing, generate large amounts of molecular data, holding a wealth of knowledge. These datasets unveil the roles of genes and their interactions with various factors through analysis, shedding light on previously unknown molecular mechanisms underlying disease pathogenesis. Furthermore, they facilitate the discovery of biomarkers crucial for diagnosis, treatment, and targeted therapy. It is important to acknowledge the possible limitations of next generation sequencing. Potential biases and technical error during experiments and data collection can introduce inaccuracies and noise. To address the limitations, robust genomic data science approaches are leveraged to unravel complex patterns hidden within omics data. This study focuses on elucidating human gene functions responsible for co-morbid phenotypes and the influence of non-coding genes, such as long non-coding RNAs, help in maintaining the genomic integrity of cells through DNA damage response mechanisms.
Phelan McDermid Syndrome (PMS) is a rare neurodevelopmental disorder caused by a deletion in the terminal end of the chromosome (22q13) region. Diagnosis of PMS presents significant challenges due to phenotypic heterogeneity and the inexperience of medical practitioners for such rare disorders. To address this challenge and better understand the complex phenotypes associated with PMS, we present a genomic data mining approach using Weighted gene co-expression network analysis (WGCNA). This approach utilizes gene expression data to identify and functionally annotate candidate genes on the 22q13 region to the five neurological phenotypes observed in PMS patients.
Besides functionally annotating genes to understand their role in disease, it is also imperative to identify mechanisms that result in genomic instability, resulting in gene mutations that cause diseases. The DNA damage response (DDR) is crucial for maintaining the genomic stability of the cell from external and internal threats. The DNA inside the cell is constantly exposed to damage; these damages are sensed and repaired by DDR in the cell using its complex and intricate pathways. Thus, identifying genes involved in DDR is pivotal for understanding immunodeficiency caused by genomic instability, thus helping in disease diagnosis and therapeutics. DNA damage is an important player in the etiology of cancer. We have developed a machine learning approach, PredDDR, to predict DDR genes in cancer by utilizing the gene expression data from the cancer genome atlas (TCGA) using feature selection. We believe PredDDR model could be applied to identify non-coding genes associated with DDR.
The role of protein-coding genes in DDR has been well elucidated. Recent studies have highlighted the role of non-coding RNAs, especially long non-coding RNAs (lncRNAs), in regulating the DDR pathways. Traditional experimental approaches face limitations while studying lncRNAs because of the biogenesis and functional characteristics of lncRNAs. With the success of PredDDR in predicting DDR genes, we developed another machine learning approach, lncDDR, to predict lncRNAs associated with DDR genes. We utilized an unsupervised representation technique, autoencoder, to extract relevant and meaningful biological features from TCGA transcriptome data to train lncDDR. We believe the genomic data science approach would help bridge the gap to understand the role of lncRNAs in DDR for the diagnosis and therapeutics of cancer. However, the predicted potential candidate DDR genes and lncRNAs associated with DDR needs to be confirmed through experiments in the lab. Future studies should aim using patient data and molecular biology techniques to validate the functional role of these predicted DDR and lncRNAs associated DDR genes in cancer.
Recommended Citation
Shah, Snehal, "Genomic Data Science Approaches for Understanding Human Diseases" (2024). All Dissertations. 3740.
https://open.clemson.edu/all_dissertations/3740
Author ORCID Identifier
https://orcid.org/0000-0002-8003-1573
Included in
Artificial Intelligence and Robotics Commons, Behavior and Behavior Mechanisms Commons, Cognitive Neuroscience Commons, Communication Sciences and Disorders Commons, Computational Biology Commons, Computational Neuroscience Commons, Congenital, Hereditary, and Neonatal Diseases and Abnormalities Commons, Databases and Information Systems Commons, Data Science Commons, Developmental Neuroscience Commons, Genetics Commons, Genomics Commons, Immune System Diseases Commons, Mental Disorders Commons, Molecular Genetics Commons, Nervous System Diseases Commons, Statistics and Probability Commons