Özet:
Text categorization plays a crucial role in both academic and commercial platforms due
to the growing demand for automatic organization of documents. Kernel-based
classification algorithms such as Support Vector Machines (SVM) have become highly
popular in the task of text mining. This is mainly due to their relatively high
classification accuracy on several application domains as well as their ability to handle
high dimensional and sparse data which is the prohibitive characteristics of textual data
representations. Recently, there is an increased interest in the exploitation of
background knowledge such as ontologies and corpus-based statistical knowledge in
text categorization. It has been shown that, by replacing the standard kernel functions
such as linear kernel with customized kernel functions which take advantage of this
background knowledge, it is possible to increase the performance of SVM in the text
classification domain. Based on this, we developed a variety of semantic kernel
methods in order to explore the capabilities of higher-order paths, class-based meaning
values and class-based weighting of terms in both supervised learning and SSL setting
for SVM.
We propose several corpus-based semantic kernels which implicitly extract and make
use of semantic relations such as Higher-Order Semantic Kernel (HOSK), Iterative
Higher-Order Semantic Kernel (IHOSK) and Higher-Order Term Kernel (HOTK) for
SVM. HOSK makes use of higher-order co-occurrence paths of terms between
xvii
documents. In HOSK, the simple dot-product between feature vectors of the documents
consist of term frequencies yields a first-order document relation matrix (F). Second–
order document matrix (S) is formed by multiplying F with itself. S is used as kernel
matrix in HOSK’s transformation from input space into feature space. The experimental
results show that HOSKshows an improvement on accuracy over linear kernel.A more
advanced model is IHOSK which uses higher-order paths between documents and terms
together in an iterative form. The document similarity matrix is produced iteratively
using SR (a similarity matrix between documents) and SC (a similarity matrix between
terms). Experiment results show that the classification performance increases relative to
the linear kernel. In our following study, we consider less complex higher-order kernel,
HOTK that is based on higher-order paths between terms only. HOTK is much simpler
than IHOSK and also requires less computational resources.
We also propose a novel approach for building a semantic kernel for SVM, which we
name CMK. We applied CMK in a Semi-supervised Learning (SSL) setting with an
addition of a new approach to initial labeling of unlabeled data, called ILBOM. The
suggested approaches smooth the term weights of a document in BOW representation
by class-based meaning values of terms. These approaches reduce the disadvantages of
BOW by increasing the importance of class specific concepts which can be synonymous
or closely related in a class. The meaning values of terms are calculated according to the
Helmholtz principle from Gestalt theory in the context of classes. Our experimental
results show that both CMK and ILBOM widely outperform the classification accuracy
of the linear kernel.
Additionally we also propose another approach which is called Class Weighting Kernel
(CWK). This approach is similar to CMK however it provides an improvement over
CMK in terms of mainly the calculation time. This class-based weighting basically
groups terms based on their importance for each class. Therefore it smooths the
representation of documents which changes the orthogonality of the vector space model
by introducing class-based dependencies between terms.
The main contribution of this dissertation is building novel semantic kernels those are
applied to supervised and semi-supervised text classification.We show that kernels
performing much better than standard kernels in terms of classification accuracy. The
proposed approaches have independency of the outside semantic sources such as
WordNet, so that they can be applied to any language domain. They also form a
foundation that can easily be combined with other term-based semantic similarity
methods such as unsupervised semantic similarity measures. To the best of our
knowledge, higher-order paths and class-based values of terms are used in the
transformation phase of SVM for the first time in the literature and give significant
benefits on the semantic smoothing of terms in a kernel for text classification.