In the following sections of this manual, you will find answers on how to read Machinese analysers' output. In the examples, the command line programs are used, but for most part, you can archive the same using the Machinese COM or library application programming interfaces.
Tokenisation
The most basic information Machinese analysers return is the boundaries of words and sentences. Tokenisation rules in Machinese analysers may divide some of the words in the input text into two tokens - for example, "isn't" is analysed to contain two tokens "is" and "n't". Sentence breaks are marked with special tags.
In Machinese Phrase Tagger, the sentence boundaries are marked with tag SB
or self-explaining phrase "sentence boundary" if using the prose output format.
Machinese Syntax marks sentence boundaries with the tag <s>
and detects paragraph ends and shows these using the tag <p>
.
For example, Machinese Phrase Tagger gives the following analysis for the sentence "Isn't Dr. Spock here?":
0 | 2 | Is | be | @MAIN | V IND PRES |
2 | 3 | n't | not | @ADVL | ADV |
6 | 3 | Dr. | dr. | @NH | N Abbr NP-Single |
10 | 5 | Spock | Spock | @NH | N Prop NP-Single |
16 | 4 | here | here | @ADVL | ADV |
20 | 1 | ? | ? | SB |
The first two columns in this default output format show the start position of the token and its length in characters. From this information, you can see that the first two tokens are written together as a single word in the input text. Also, you can see that the dot in "Dr." does not mark the end of the sentence, but the question mark does as it is marked with the tag SB
.
Base Forms
In most languages, words are inflected in various case forms to form sensible phrases. However, in many applications it is desirable to find and match all occurrences of the same word despite the fact that the inflected word forms in the text are quite different - take for example irregular verbs like "be", which can also appear in forms like "is", "was" and "were". Luckily, Machinese analysers return the base form (also known as lemma) for any word.
All Machinese output formats will show the base form information, although there are small differences in the display properties. For example, in the analysis in the default output format of Machinese Phrase Tagger, the base form is shown in the fourth column.
0 | 5 | Uuden | uusi | @PREMOD | A NP-First |
6 | 11 | bussiyhtiön | bussi yhtiö | @NH | N NP-Last |
18 | 2 | on | olla | @MAIN | V IND PRES |
21 | 5 | määrä | määrä | @NH | N NP-Single |
27 | 8 | aloittaa | alkaa | @MAIN | V INF |
36 | 11 | toimintansa | toiminta | @NH | N PL NP-Single |
48 | 4 | heti | heti | @ADVL | ADV |
53 | 6 | vuoden | vuosi | @PREMOD | N NP-First |
60 | 8 | vaihteen | vaihde | @NH | N NP-Last |
69 | 7 | jälkeen | jälkeen | @PREMARK | PREP |
76 | 1 | . | . | SB |
One noteworthy detail is also that Machinese analysers mark the word boundaries of compound words (like the word bussiyhtiö in the above Finnish language example) with the non-breaking space character (U+00A0
).
Part of Speech
Part of speech (PoS) or word class information describes what role the word has in the phrase or sentence. As there are several words which can occur in text in multiple PoS categories, Machinese analysers will use the context information to determine which role the word has in the sentence.
All Machinese output formats will show the part-of-speech information, although there are small differences in the display properties.
For example the following Machinese Phrase Tagger analysis for sentence 'I saw him walking with a saw.' demonstrates, that the analyser detects the two different senses of word 'saw':
0 | 1 | I | I | @NH | PRON |
2 | 3 | saw | see | @MAIN | V IND PAST |
6 | 3 | him | he | @NH | PRON |
10 | 7 | walking | walk | @MAIN | V PCP PROG |
18 | 4 | with | with | @PREMARK | PREP |
23 | 1 | a | a | @PREMOD | DET |
25 | 3 | saw | saw | @NH | N NP-Single |
28 | 1 | . | . | SB |
For more detailed description of PoS analysis in Machinese analysers, please see the Machinese Language Model manual.
Morphology
Morphological information tells the details of the word forms used in the text.
Please note that Machinese Phrase Tagger shows only a limited number of morphological features while Machinese Syntax offers much more detailed analysis. For example, Machinese Phrase Tagger gives in the prose output mode the following analysis for the French language sentence 'C'est la vie.':
0 | 2 | C' | ce | @NH | PRON |
2 | 3 | est | être | @MAIN | V IND PRES |
6 | 2 | la | la | @PREMOD | DET |
9 | 3 | vie | vie | @NH | N NP-Single |
12 | 1 | . | . | SB |
While Machinese Syntax provides the following analysis, which includes many morphological features which Machinese Phrase Tagger omitted:
1 | C' | ce | subj>2 | @NH PRON Dem MSC SG |
---|---|---|---|---|
2 | est | être | main>0 | @MAIN V IND PRES SG P3 |
3 | la | la | det>4 | @PREMOD DET Art Def FEM SG |
4 | vie | vie | comp>2 | @NH N FEM SG |
5 | . | . | ||
6 | <s> | <s> |
Detailed description of these analyses can be found in the Machinese Language Model manual.
Syntax
Whereas part of speech and morphology give information on individual words, syntax describes the information about relations between words within phrases or sentences. Machinese Phrase Tagger tells what syntactic function each word has. Machinese Syntax gives more detailed description, which provides details on what role the syntactic relation has and which words are adjoined by this relation.
If you compare the analyses shown in the morphology chapter above, you can see that the fourth field in the Machinese Syntax output lists the abbreviated syntax relation name and number designating which word the relation points to.
Detailed description of what syntactic analyses Machinese analysers will produce can be found in the Machinese Language Model manual.
Noun Phrases
In information retrieval applications, the interesting part of the text is usually nouns or noun phrases, as these describe which subjects and objects the text is about. Single nouns are obviously easy to pick from text (see information on part of speech), but often noun phrases tell more as they combine the information of adjacent words. Machinese Phrase Tagger makes finding noun phrases easy as it includes noun phrase detection, which marks where a noun phrase starts, ends and which words in between belong to that noun phrase.
Take for example sentence 'The bloody failure of the police state run by Ceausescu could have become a drawn-out civil war.' Now, when analysed with Machinese Phrase Tagger, the default output format looks like following:
0 | 3 | The | the | @PREMOD | DET |
4 | 6 | bloody | bloody | @PREMOD | A NP-First |
11 | 7 | failure | failure | @NH | N NP-Internal |
19 | 2 | of | of | @POSTMOD | PREP NP-Internal |
22 | 3 | the | the | @PREMOD | DET NP-Internal |
26 | 6 | police | police | @PREMOD | N PL NP-Internal |
33 | 5 | state | state | @NH | N NP-Last |
39 | 3 | run | run | @MAIN | V PCP PERF |
43 | 2 | by | by | @PREMARK | PREP |
46 | 9 | Ceausescu | Ceausescu | @NH | N Prop NP-Single |
56 | 5 | could | could | @AUX | V IND PRES |
62 | 4 | have | have | @AUX | V INF |
67 | 6 | become | become | @MAIN | V PCP PERF |
74 | 1 | a | a | @PREMOD | DET |
77 | 9 | drawn-out | drawn out | @PREMOD | A NP-First |
87 | 5 | civil | civil | @PREMOD | A NP-Internal |
93 | 3 | war | war | @NH | N NP-Last |
96 | 1 | . | . | SB |
In the output, you can see the following tags, which denote parts of noun phrase constructs:
- NP-Single
- a single noun.
- NP-First
- first word of multi-word noun phrase
- NP-Last
- last word of multi-word noun phrase
- NP-Internal
- if a noun phrase is longer than three words, the words between the first and the last of the noun phrase are marked with this tag.
Following these tags, you can extract the three noun phrases of the above sentence: 'bloody failure of the police state', 'Ceausescu', and 'drawn-out civil war'.
Proper Noun Detection
Often it is desirable to find names from a text. The solution is to use the
information from Machinese Phrase Tagger's proper noun detection. The output formats will include the proper noun information alongside with the morphological information. The tag used to mark proper nouns is Prop
and in the prose output the same is described with the denotation "proper noun". When combined with the noun phrase information, it is possible to pick up multiword names, for example pairs of first and last names.