You are here

Document Classification Using Machine Learning and Ontologies

TitleDocument Classification Using Machine Learning and Ontologies
Publication TypeThesis
Year of Publication2011
AuthorsNyberg K
UniversityAalto University
Thesis Typemasters
Keywordsbag of words, document classification, logistic discriminant, machine learning, ontologies, syntactical analysis, YSO

This master’s thesis explores a way in which documents can be automatically classified based on their contents. Automatic classification of data is one of the main applications of machine learning. With the help of already classified data a model for the most likely class can be learned. Whether adding background knowledge from ontologies can be added to the model in order to improve the classification accuracy, is also explored in this master’s thesis. A new machine learning model is introduced that incorporates ontology information.
The proposed method for learning a classification model and enhancing it with ontology information is used in a case study for the Finnish National Archives and a set of digital documents that have been manually classified. An RDF schema for representing documents, sentences and words is created in order to prepare tha data for the machine learning analysis. The words are put into base form and matched semi-automatically with concepts of the General Finnish Ontology YSO. Then the ontology enhanced model is applied on the data and the most likely classes for documents are learned.
The master’s thesis shows that the classification accuracy of the model increases when ontology information is added to it.