You are here

Issues in topic tracking in Wikipedia articles

TitleIssues in topic tracking in Wikipedia articles
Publication TypeJournal Article
Year of Publication2011
AuthorsKonstantinova, N, Orăsan, C
JournalStudia Univ. Babes–Bolyai, Informatica

In the last few years, Wikipedia has become a very useful resource for NLP offering access to both structured and unstructured information that can be used for further language processing. One particularity of the Wikipedia articles is that they focus on only one topic (eg a product, person, location or event), which is detailed throughout the article. In order to extract comprehensive information from these articles, it is necessary to be able to track different expressions that refer to the topic. This paper discusses the issues to be tackled when a topic tracking algorithm is implemented. In order to address this problem, a shallow rule-based coreference resolution method
for topic tracking was implemented.
The results of this research are intended to be used for the development of an interactive question answering (IQA) system that guides users in their search process. The answers to be provided by the IQA system will be acquired
using information extraction from Wikipedia pages. To make this process more precise, it is necessary to track all the mentions of the topic throughout the article regardless of how the topic is expressed.
Attempts to use state-of-the art systems for coreference resolution showed that they provide very low precision for the task in question and link NPs which are not coreferential at all. In most cases it happens because the algorithms rely heavily on substring matching and distinguish rather poorly between entities with similar names. It can be seen very well when examining the chain generated by RECONCILE [6] for the article describing mobile phone “HTC Magic”:’The HTC Magic’-’HTC’-’The HTC Dream’-’Vodafone’ -’it’-’the Vodafone Magic’. The low performance of the state-of-the-art systems provided us with a motivation for developing our own system that
will work with high accuracy for our domain.
This paper presents the first step of the research: analysis of how the topic is referred to in Wikipedia articles and which issues need to be addressed when developing a topic tracking method. Linguistic investigation of the referential expressions denoting the topic revealed that the notion of coreference is not broad enough. This issue is discussed in Section 2 with emphasis on the particularities of the Wikipedia pages. The experiment and design of
evaluation are described in Section 3. The paper finishes by discussing the results of the research and conclusions.