Multivariate Methods in Corpus-Based Lexicography: A Study of Synonymy in Finnish

Publication TypeConference Paper
Year of Publication2007
AuthorsArppe, A
Conference NameThe fourth Corpus Linguistics conference
PublisherUniversity of Birmingham
Conference LocationBirmingham, UK

The purpose of this paper is to present a case study of how multivariate statistical methods such as polytomous logistic regression can be adapted to discover and analyze the wide and complex range of linguistic factors which both influence and interact in the selection and usage of sets of more than two near-synonyms. The results reported in this paper are a follow-up of Arppe (2006), and a preliminary version of those to be presented in full in Arppe (forthcoming).
In the modeling of lexical choice among semantically similar words, specifically near-synonyms, it has been suggested in computational theory that (at least) three levels of representation would be necessary to account for fine-grained meaning differences and the associated usage preferences, namely a 1) conceptual-semantic level, a 2) subconceptual/stylistic-semantic level, and a 3) syntactic-semantic level (Edmonds and Hirst, 2002). With regards to the syntactic-semantic level, it has in the been shown in (mainly) lexicographically motivated corpus-based studies of actual lexical usage that semantically similar words differ significantly as to the 1) lexical context (e.g. English powerful vs. strong in Church et al., 1991), the 2) syntactic structures which they form part of (e.g. English begin vs. start in Biber et al., 1998), and the 3) semantic classification of some particular argument (e.g. English shake verbs in Atkins and Levin, 1996), as well as rather style-associated 4) text type, in which they are used (e.g. Biber et al., 1998).
In addition to these studies that have focused on English, with its minimal morphology, it has also been shown for languages with an extensive morphological system, such as Finnish, that similar differentiation is evident as to the 5) inflectional forms and the associated morphosyntactic features in which synonyms are used (e.g., the Finnish adjectives tärkeä vs. keskeinen ’important, central’ in Jantunen, 2001, and Finnish verbs miettiä and pohtia ‘think, ponder, reflect, consider’ in Arppe, 2002, Arppe and Järvikivi, forthcoming). Recently, in their studies of Russian near-synonymous verbs denoting ‘try’ and ‘intend’, Divjak (2006) and Divjak and Gries (2006) have shown that there is often more than one type of these factors in play at the same time, and that it is therefore worthwhile to observe all categories together and in unison rather than separately one by one.
All of these studies of synonymy have focused on which contextual factors differentiate words denoting a similar semantic content. In other words, which directly observable factors determine which word in a group of synonyms is selected in a particular context. This general development represents a shift away from more traditional arm-chair introspections about the connotations and range-of-use of synonyms, and it has been made possible by the accelerating development in the last decade or so of corpus linguistic resources, i.e. corpora, and tools, e.g. parsers and statistical programs.