|Title||Ontology Learning from Text – IS-A Relation Identification|
|Publication Type||Conference Paper|
|Year of Publication||2008|
|Authors||Ryu, P-M, Choi, K-S|
The web is evolving from a huge information and communication space into a massive knowledge and service repository. The Semantic Web is a vision of such evolution where machines can perform more of the tedious work involved in finding, sharing and combining information on the web [Wikipedia_Semantic_Web]. Ontology provides a sound semantic ground of machine-understandable description of digital content. Thus, The Semantic Web relies heavily on formal ontologies to structure data for comprehensive and transportable machine understanding [Zhou, 2007]. Thus, the proliferation of ontologies factors largely in the Semantic Web’s success. Current state of domain ontology development is still in its infancy in terms of both quantity and quality. One of the main issues is the high cost associated with manual knowledge acquisition for ontology construction. Acquiring domain knowledge requires many resources and is time consuming. The knowledge acquisition process for most existing ontology construction is mainly conducted on a manual basis. This process has become one of the bottlenecks of the ontology development. For this reason, how to effectively acquire ontological knowledge from available resources in order to reduce the effort has become a hot topic in the ontology research community [Maedche & Staab, 2001].
Ontology learning problem has majorly three issues; ontology components of learning, data resources and learning approaches. There are many kinds of ontology components from classes and semantic relations to logical restriction. There are also many types of resources for ontology learning from free text to structured knowledge such as thesaurus or lexical ontology. Once the target ontology component and data resource is identified, the decision on learning strategies can be made. In view of data resources, they should cover wide area to extract unbiased knowledge of target domain and be well formed enough to extract accurate knowledge. Recently, there is an increasing interest in the automatic extraction of structured information from large corpora and, in particular, from the Web. Beside free text of plane web data, new types of web content such as blogs and wikis, are also a source of textual information that contain an underlying structure from which specialist systems can benefit . Various relation extraction methods have been developed suitable to the different types of resources. General pattern matching method is the efficient to extract relations from free text. Strict pattern matching method can be applied to semi-structured text such as definitions or glossaries. Classification methods based on machine learning or rules are applied to classify relations types from existing structured resources such as Wikipedia category system or thesaurus.
IS-A relation or hyponymy is a partial ordering that organizes terms or concepts from the general to the specific ones. The specific inherits features of the general. IS-A relation is a backbone of knowledge organization and sharing. Especially, as a major component of ontology, it can provide an organizational model for a domain (domain ontology). Taxonomy is a structure of classifications for a given set of objects and IS-A relations among the objects. In a simple network domain taxonomy, a term ‘computer network’ is hyponym of ‘electronic network’ and as well as hypernym of ‘WAN’ and ‘LAN’ (Figure 1). We denote a IS-A relation as Eq (1) where the hyponym is an instance or sub class of the hypernym. It can be thought of as being a shorthand for ‘is a type of’ or ‘is a kind of’. In view of ontology relation, we alternatively call the hyponym and the hypernym as domain and range, respectively.
We propose new IS-A relation extraction method from text based on pattern matching and machine learning approach. In the first step, lexico-syntactic patterns are applied to extract IS-A relation candidates. Because semantic ambiguity is inherent in the pattern, part of the candidates are not real IS-A relations. Thus, true IS-A relations are classified using machine learning method in the next step. Features for general relation classification problem and specific to IS-A relation are exploited together. Wikipedia category system and Wiktionary definition structure are also exploited as features for the classifier.