Date of Award
12-2012
Document Type
Thesis
Degree Name
Master of Science (MS)
Legacy Department
Computer Engineering
Committee Chair/Advisor
Brooks, Richard R
Committee Member
Wang , Kuang-Ching
Committee Member
Hoover , Adam
Abstract
Data leak prevention (DLP) solutions monitor and control data flow. Current techniques find data that matches user defined syntactic patterns. Unfortunately, large classes of DLP relevant data are defined by information semantics, rather than data syntax. Syntax refers to data format, whereas semantics refers to data meaning. The class of social security numbers can be adequately expressed using data syntax, whereas a new industrial process can only be adequately described using information semantics. In this paper, we propose methods for extracting and identifying document semantics using training sets of limited size (tens of documents). The first method is based on singular value decomposition, which uses linear algebra to automatically extract semantic features from documents in the training set. The second method is to infer a hidden Markov model (HMM) expressing relations between the features extracted using the singular value method. This HMM can detect documents containing the intellectual property semantic information. A third method views the English language as a probabilistic context-free grammar (PCFG), and extracts semantic information from individual sentences in order to detect documents containing intellectual property. Test results on 5 document sets show the proposed methods give at least 84% true positive and below 22% false positive rates. Our methods are trained with only positive examples, and have lower false positive rates, compared to Latent Dirichlet Allocation (LDA) and Support Vector Machines (SVM).
Recommended Citation
Zhao, Lianyu, "Semantic Similarity Detection in Natural Language Documents" (2012). All Theses. 1526.
https://open.clemson.edu/all_theses/1526