Date of Award
8-2016
Document Type
Thesis
Degree Name
Master of Science (MS)
Legacy Department
Computer Science
Committee Member
Dr. Amy Apon, Committee Chair
Committee Member
Dr. Brian Malloy
Committee Member
Dr. Paul Wilson
Abstract
Dynamic Topic Models (DTM) are a way to extract time-variant information from a collection of documents. The only available implementation of this is slow, taking days to process a corpus of 533,588 documents. In order to see how topics - both their key words and their proportional size in all documents - change over time, we analyze Clustered Latent Dirichlet Allocation (CLDA) as an alternative to DTM. This algorithm is based on existing parallel components, using Latent Dirichlet Allocation (LDA) to extract topics at local times, and k-means clustering to combine topics from dierent time periods. This method is two orders of magnitude faster than DTM, and allows for more freedom of experiment design. Results show that most topics generated by this algorithm are similar to those generated by DTM at both the local and global level using the Jaccard index and Sørensen-Dice coecient, and that this method's perplexity compares favorably to DTM. We also explore tradeos in CLDA method parameters.
Recommended Citation
Gropp, Christopher, "Analyzing Clustered Latent Dirichlet Allocation" (2016). All Theses. 2471.
https://open.clemson.edu/all_theses/2471