Machinese Phrase Tagger is a set of program components that performs basic linguistic analysis tasks at very high speed and provides relevant information about words and concepts to volume-intensive applications.
Machinese Phrase Tagger splits raw text into understandable word units and provides the possible base forms and classes for words. It also disambiguates i.e. selects the correct form and class for each word that can have more than one interpretation and identifies the head words of a sentence. For example, the word "thought" can be either a form of the noun "thought" or the verb "to think".
Here is example of Machinese Phrase Tagger output for sentence "I thought Vlad was happy with this new agreement.":
0 1 I I @NH PRON 2 7 thought think @MAIN V IND PAST 10 4 Vlad Vlad @NH N Prop NP-Single 15 3 was be @MAIN V IND PAST 19 5 happy happy @NH A 25 4 with with @POSTMOD PREP 30 4 this this @PREMOD PRON 35 3 new new @PREMOD A NP-First 39 9 agreement agreement @NH N NP-Last 48 1 . . SB
The first column shows the token's position in the text (counted in characters), the next one tells the length of the token, third column is for the text form and the fourth shows the baseform from which the text form has been derived from. The fifth column denotes the syntactic role of the token (@NH = nominal head, @MAIN = main verb, @POSTMOD = postmodifier, @PREMOD = premodifier) while the next columns shows the morphological information in form of part of speech tag (PRON = pronoun, V = verb, DET = determiner, N = noun, A = adjective, PREP = preposition) and inflectional information (in the example only present for verbs: IND = indicative, PAST = past tense). For those token that belong to a noun phrase there is information about this in separate column: NP-single = single word NP, NP-first = first word of NP, NP-Last = last word of NP. Finally, there is column for denoting the sentence break (tag SB).
Machinese Phrase Tagger contains a custom lexicon mechanism, which enables developers to add their own words to the parser. These words can be, for example, domain-specific vocabularies, multi-word terms, names and places etc. This way developers can influence how the parser analyses texts.