The need for annotated data has been widely recognised by the CL community, especially for the purpose of training and testing statistical methods.
The approach described in this paper differs from the mainstream in not focusing on developing the corpus, but efficient tools that provide high-quality annotated data. However, due to the use of linguistic, rule-based methods, we stress the importance of adequate linguistic description and well-defined annotation scheme.
Our approach is based on long-term research that has been conducted by a team of researchers in Helsinki since late 1980s. The formalisms and parsing systems developed at the University of Helsinki are:
̄ Constraint Grammar (CG)
̄ Finite-State Intersection Grammar
̄ Functional Dependency Grammar (FDG).
From the very beginning, the idea was to develop robust tools for annotating unrestricted text. It is characteristic of all the approaches mentioned above that comprehensive hand-crafted grammars were being built and a separate language-independent parsing engine was developed to optimise the parsing efficiency. When the good results of the English Constraint Grammar (ENGCG) parser became known in the research community, this system was chosen to annotate the Bank of English corpus in the early 1990s.
This paper describes a new annotation scheme that is suitable for creating multi-layered linguistic descriptions such as linguistic treebanks. The new scheme, based on earlier research and practical implementation known as Functional Dependency Grammar, is centred around dependency-based syntactic description to structure linguistic information. The scheme is applied for a comprehensive description of English in an existing tool, a syntactic parser called Machinese Semantics. The tool and the annotation scheme have been initially developed and tested in a machine translation project Lingmachine (MLIS-5008).