Corpus-based semantic kernels for supervised and semi-supervised text classification

Altınel, Berna

Corpus-based semantic kernels for supervised and semi-supervised text classification

Altınel, Berna

URI: http://dspace.yildiz.edu.tr/xmlui/handle/1/13145

Tarih: 2015

Özet:

Text categorization plays a crucial role in both academic and commercial platforms due to the growing demand for automatic organization of documents. Kernel-based classification algorithms such as Support Vector Machines (SVM) have become highly popular in the task of text mining. This is mainly due to their relatively high classification accuracy on several application domains as well as their ability to handle high dimensional and sparse data which is the prohibitive characteristics of textual data representations. Recently, there is an increased interest in the exploitation of background knowledge such as ontologies and corpus-based statistical knowledge in text categorization. It has been shown that, by replacing the standard kernel functions such as linear kernel with customized kernel functions which take advantage of this background knowledge, it is possible to increase the performance of SVM in the text classification domain. Based on this, we developed a variety of semantic kernel methods in order to explore the capabilities of higher-order paths, class-based meaning values and class-based weighting of terms in both supervised learning and SSL setting for SVM. We propose several corpus-based semantic kernels which implicitly extract and make use of semantic relations such as Higher-Order Semantic Kernel (HOSK), Iterative Higher-Order Semantic Kernel (IHOSK) and Higher-Order Term Kernel (HOTK) for SVM. HOSK makes use of higher-order co-occurrence paths of terms between xvii documents. In HOSK, the simple dot-product between feature vectors of the documents consist of term frequencies yields a first-order document relation matrix (F). Second– order document matrix (S) is formed by multiplying F with itself. S is used as kernel matrix in HOSK’s transformation from input space into feature space. The experimental results show that HOSKshows an improvement on accuracy over linear kernel.A more advanced model is IHOSK which uses higher-order paths between documents and terms together in an iterative form. The document similarity matrix is produced iteratively using SR (a similarity matrix between documents) and SC (a similarity matrix between terms). Experiment results show that the classification performance increases relative to the linear kernel. In our following study, we consider less complex higher-order kernel, HOTK that is based on higher-order paths between terms only. HOTK is much simpler than IHOSK and also requires less computational resources. We also propose a novel approach for building a semantic kernel for SVM, which we name CMK. We applied CMK in a Semi-supervised Learning (SSL) setting with an addition of a new approach to initial labeling of unlabeled data, called ILBOM. The suggested approaches smooth the term weights of a document in BOW representation by class-based meaning values of terms. These approaches reduce the disadvantages of BOW by increasing the importance of class specific concepts which can be synonymous or closely related in a class. The meaning values of terms are calculated according to the Helmholtz principle from Gestalt theory in the context of classes. Our experimental results show that both CMK and ILBOM widely outperform the classification accuracy of the linear kernel. Additionally we also propose another approach which is called Class Weighting Kernel (CWK). This approach is similar to CMK however it provides an improvement over CMK in terms of mainly the calculation time. This class-based weighting basically groups terms based on their importance for each class. Therefore it smooths the representation of documents which changes the orthogonality of the vector space model by introducing class-based dependencies between terms. The main contribution of this dissertation is building novel semantic kernels those are applied to supervised and semi-supervised text classification.We show that kernels performing much better than standard kernels in terms of classification accuracy. The proposed approaches have independency of the outside semantic sources such as WordNet, so that they can be applied to any language domain. They also form a foundation that can easily be combined with other term-based semantic similarity methods such as unsupervised semantic similarity measures. To the best of our knowledge, higher-order paths and class-based values of terms are used in the transformation phase of SVM for the first time in the literature and give significant benefits on the semantic smoothing of terms in a kernel for text classification.