You are here

Machinese Phrase Tagger

Machinese Phrase Tagger is a set of program components that performs basic linguistic analysis tasks at very high speed and provides relevant information about words and concepts to volume-intensive applications.

Machinese Phrase Tagger splits raw text into understandable word units and provides the possible base forms and classes for words. It also disambiguates i.e. selects the correct form and class for each word that can have more than one interpretation and identifies the head words of a sentence. For example, the word "thought" can be either a form of the noun "thought" or the verb "to think".

Here is example of Machinese Phrase Tagger output for sentence "I thought Vlad was happy with this new agreement.":

0       1       I       I       @NH     PRON
2       7       thought think   @MAIN   V       IND     PAST
10      4       Vlad    Vlad    @NH     N       Prop                    NP-Single
15      3       was     be      @MAIN   V       IND     PAST
19      5       happy   happy   @NH     A
25      4       with    with    @POSTMOD        PREP
30      4       this    this    @PREMOD PRON
35      3       new     new     @PREMOD A                               NP-First
39      9       agreement       agreement       @NH     N                      NP-Last
48      1       .       .                                                      SB

The first column shows the token's position in the text (counted in characters), the next one tells the length of the token, third column is for the text form and the fourth shows the baseform from which the text form has been derived from. The fifth column denotes the syntactic role of the token (@NH = nominal head, @MAIN = main verb, @POSTMOD = postmodifier, @PREMOD = premodifier) while the next columns shows the morphological information in form of part of speech tag (PRON = pronoun, V = verb, DET = determiner, N = noun, A = adjective, PREP = preposition) and inflectional information (in the example only present for verbs: IND = indicative, PAST = past tense). For those token that belong to a noun phrase there is information about this in separate column: NP-single = single word NP, NP-first = first word of NP, NP-Last = last word of NP. Finally, there is column for denoting the sentence break (tag SB).

Machinese Phrase Tagger contains a custom lexicon mechanism, which enables developers to add their own words to the parser. These words can be, for example, domain-specific vocabularies, multi-word terms, names and places etc. This way developers can influence how the parser analyses texts.