Weighted Finite-State Morphological Analysis of Finnish Compounding with HFST-LEXC

Publication TypeConference Paper
Year of Publication2009
AuthorsLindén, K, Pirinen, T
EditorJokinen, K, Bick, E
Conference NameNordic Conference of Computational Linguistics NODALIDA 2009
PublisherNorthern European Association for Language Technology (NEALT)
Conference LocationOdense, Denmark

Finnish has a very productive compounding and a rich inflectional system, which causes ambiguity in the morphological segmentation of compounds made with finite state transducer methods. In order to disambiguate the compound segmentations, we compare three different strategies, which are all cast in the same probabilistic framework and compared for the first time. We present a method for implementing the probabilistic framework as part of the building process of LexC-style morpheme sub-lexicons creating weighted lexical transducers. To implement the structurally disambiguating morphological analyzer, we use the HFST-LEXC tool which is part
of the open source Helsinki Finite-State Technology. Using our Finnish test corpus with 53 270 compounds, we demonstrate that it is possible to use non-compound token probabilities to disambiguate the compounding structure. Non-compound token probabilities are easy to obtain from raw data compared with obtaining the
probabilities of prefixes of segmented and disambiguated compounds.