Date of Award

8-2019

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Mathematical Sciences

Committee Member

William Bridges, Committee Chair

Committee Member

Alexander Herzog

Committee Member

Jun Luo

Committee Member

Christopher McMahan

Committee Member

Ilya Safro

Abstract

Topic modeling has been used widely to extract the structures (topics) in a collection (corpus) of documents. One popular method is the Latent Dirichlet Allocation (LDA). LDA assumes a Bayesian generative model with multinomial distributions of topics and vocabularies within the topics. The LDA model result (i.e., the number and types of topics in the corpus) depends on tuning parameters. Several methods, ad hoc or heuristic, have been proposed and analyzed for selecting these parameters. But all these methods have been developed using one or more real corpora. Unfortunately, with real corpora, the true number and types of topics are unknown and it is difficult to assess how well the data follow the assumptions of LDA. To address this issue, we developed a factorial simulation design to create corpora with known structure that varied on the following four factors: 1) number of topics, 2) proportions of topics in documents, 3) size of the vocabulary in topics, and 4) proportion of vocabulary that is contained in documents. Results suggest that the quality of LDA fitting depends on the document-topic distribution and the fitting performs the best when the document lengths are at least four times the vocabulary size. We have also proposed a pre-processing method that may be used to increase quality of the LDA result in some of the worst-case scenarios from the factorial simulation study.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.