You are here

Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods

Title	Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods
Publication Type	Thesis
Year of Publication	2000
Authors	Lahtinen T
Academic Department	Faculty of Arts, Department of General Linguistics
Degree	Doctor of Philosophy
Date Published	12/2000
University	University of Helsinki
City	Helsinki
ISBN Number	951-45-9639
Keywords	indexing, term recognition
Abstract	This thesis discusses the problems and the methods of finding relevant information in large collections of documents. The contribution of this thesis to this problem is to develop better content analysis methods which can be used to describe document content with index terms. Index terms can be used as meta-information that describes documents, and that is used for seeking information. The main point of this thesis is to illustrate the process of developing an automatic indexer which analyses the content of documents by combining evidence from word frequencies and evidence from linguistic analysis provided by a syntactic parser. The indexer weights the expressions of a text according to their estimated importance for describing the content of a given document on the basis of the content analysis. The typical linguistic features of index terms were explored using a linguistically analysed text collection where the index terms are manually marked up. This text collection is referred to as an index term corpus. Specific features of the index terms provided the basis for a linguistic term-weighting scheme, which was then combined with a frequency-based term-weighting scheme. The use of an index term corpus like this as training material is a new method of developing an automatic indexer. The results of the experiments were promising.
URL	http://urn.fi/URN:ISBN:951-45-9640-4

Google Scholar