You are here

Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods

TitleAutomatic indexing: an approach using an index term corpus and combining linguistic and statistical methods
Publication TypeThesis
Year of Publication2000
AuthorsLahtinen, T
Academic DepartmentFaculty of Arts, Department of General Linguistics
DegreeDoctor of Philosophy
Date Published12/2000
UniversityUniversity of Helsinki
CityHelsinki
ISBN Number951-45-9639
Keywordsindexing, term recognition
Abstract

This thesis discusses the problems and the methods of finding relevant information in large collections of documents. The contribution of this thesis to this problem is to develop better content analysis methods which can be used to describe document content with index terms. Index terms can be used as meta-information that describes documents, and that is used for seeking information. The main point of this thesis is to illustrate the process of developing an automatic indexer which analyses the content of documents by combining evidence from word frequencies and evidence from linguistic analysis provided by a syntactic parser. The indexer weights the expressions of a text according to their estimated importance for describing the content of a given document on the basis of the content analysis. The typical linguistic features of index terms were explored using a linguistically analysed text collection where the index terms are manually marked up. This text collection is referred to as an index term corpus. Specific features of the index terms provided the basis for a linguistic term-weighting scheme, which was then combined with a frequency-based term-weighting scheme. The use of an index term corpus like this as training material is a new method of developing an automatic indexer. The results of the experiments were promising.

URLhttp://urn.fi/URN:ISBN:951-45-9640-4