You are here

How to Read Machinese Analysers' Output

In the following sections of this manual, you will find answers on how to read Machinese analysers' output. In the examples, the command line programs are used, but for most part, you can archive the same using the Machinese COM or library application programming interfaces.

Tokenisation

The most basic information Machinese analysers return is the boundaries of words and sentences. Tokenisation rules in Machinese analysers may divide some of the words in the input text into two tokens - for example, "isn't" is analysed to contain two tokens "is" and "n't". Sentence breaks are marked with special tags.

In Machinese Phrase Tagger, the sentence boundaries are marked with tag SB or self-explaining phrase "sentence boundary" if using the prose output format.

Machinese Syntax marks sentence boundaries with the tag <s> and detects paragraph ends and shows these using the tag <p>.

For example, Machinese Phrase Tagger gives the following analysis for the sentence "Isn't Dr. Spock here?":

02Is be @MAIN V IND PRES 
23n't not @ADVL ADV 
63Dr. dr. @NH N Abbr NP-Single 
105Spock Spock @NH N Prop NP-Single 
164here here @ADVL ADV 
201SB 

The first two columns in this default output format show the start position of the token and its length in characters. From this information, you can see that the first two tokens are written together as a single word in the input text. Also, you can see that the dot in "Dr." does not mark the end of the sentence, but the question mark does as it is marked with the tag SB.

Base Forms

In most languages, words are inflected in various case forms to form sensible phrases. However, in many applications it is desirable to find and match all occurrences of the same word despite the fact that the inflected word forms in the text are quite different - take for example irregular verbs like "be", which can also appear in forms like "is", "was" and "were". Luckily, Machinese analysers return the base form (also known as lemma) for any word.

All Machinese output formats will show the base form information, although there are small differences in the display properties. For example, in the analysis in the default output format of Machinese Phrase Tagger, the base form is shown in the fourth column.

05Uuden uusi @PREMOD A NP-First 
611bussiyhtiön bussi yhtiö @NH N NP-Last 
182on olla @MAIN V IND PRES 
215määrä määrä @NH N NP-Single 
278aloittaa alkaa @MAIN V INF 
3611toimintansa toiminta @NH N PL NP-Single 
484heti heti @ADVL ADV 
536vuoden vuosi @PREMOD N NP-First 
608vaihteen vaihde @NH N NP-Last 
697jälkeen jälkeen @PREMARK PREP 
761SB 

One noteworthy detail is also that Machinese analysers mark the word boundaries of compound words (like the word bussiyhtiö in the above Finnish language example) with the non-breaking space character (U+00A0).

Part of Speech

Part of speech (PoS) or word class information describes what role the word has in the phrase or sentence. As there are several words which can occur in text in multiple PoS categories, Machinese analysers will use the context information to determine which role the word has in the sentence.

All Machinese output formats will show the part-of-speech information, although there are small differences in the display properties.

For example the following Machinese Phrase Tagger analysis for sentence 'I saw him walking with a saw.' demonstrates, that the analyser detects the two different senses of word 'saw':

@NH PRON 
saw see @MAIN V IND PAST 
him he @NH PRON 
10 walking walk @MAIN V PCP PROG 
18 with with @PREMARK PREP 
23 @PREMOD DET 
25 saw saw @NH N NP-Single 
28 SB 

For more detailed description of PoS analysis in Machinese analysers, please see the Machinese Language Model manual.

Morphology

Morphological information tells the details of the word forms used in the text.

Please note that Machinese Phrase Tagger shows only a limited number of morphological features while Machinese Syntax offers much more detailed analysis. For example, Machinese Phrase Tagger gives in the prose output mode the following analysis for the French language sentence 'C'est la vie.':

C' ce @NH PRON 
est être @MAIN V IND PRES 
la la @PREMOD DET 
vie vie @NH N NP-Single 
12 SB 

While Machinese Syntax provides the following analysis, which includes many morphological features which Machinese Phrase Tagger omitted:

C' ce subj>2 @NH PRON Dem MSC SG 
est être main>0 @MAIN V IND PRES SG P3 
la la det>4 @PREMOD DET Art Def FEM SG 
vie vie comp>2 @NH N FEM SG 
<s> <s> 

Detailed description of these analyses can be found in the Machinese Language Model manual.

Syntax

Whereas part of speech and morphology give information on individual words, syntax describes the information about relations between words within phrases or sentences. Machinese Phrase Tagger tells what syntactic function each word has. Machinese Syntax gives more detailed description, which provides details on what role the syntactic relation has and which words are adjoined by this relation.

If you compare the analyses shown in the morphology chapter above, you can see that the fourth field in the Machinese Syntax output lists the abbreviated syntax relation name and number designating which word the relation points to.

Detailed description of what syntactic analyses Machinese analysers will produce can be found in the Machinese Language Model manual.

Noun Phrases

In information retrieval applications, the interesting part of the text is usually nouns or noun phrases, as these describe which subjects and objects the text is about. Single nouns are obviously easy to pick from text (see information on part of speech), but often noun phrases tell more as they combine the information of adjacent words. Machinese Phrase Tagger makes finding noun phrases easy as it includes noun phrase detection, which marks where a noun phrase starts, ends and which words in between belong to that noun phrase.

Take for example sentence 'The bloody failure of the police state run by Ceausescu could have become a drawn-out civil war.' Now, when analysed with Machinese Phrase Tagger, the default output format looks like following:

The the @PREMOD DET 
bloody bloody @PREMOD A NP-First 
11 failure failure @NH N NP-Internal 
19 of of @POSTMOD PREP NP-Internal 
22 the the @PREMOD DET NP-Internal 
26 police police @PREMOD N PL NP-Internal 
33 state state @NH N NP-Last 
39 run run @MAIN V PCP PERF 
43 by by @PREMARK PREP 
46 Ceausescu Ceausescu @NH N Prop NP-Single 
56 could could @AUX V IND PRES 
62 have have @AUX V INF 
67 become become @MAIN V PCP PERF 
74 @PREMOD DET 
77 drawn-out drawn out @PREMOD A NP-First 
87 civil civil @PREMOD A NP-Internal 
93 war war @NH N NP-Last 
96 SB 

In the output, you can see the following tags, which denote parts of noun phrase constructs:

NP-Single
a single noun.
NP-First
first word of multi-word noun phrase
NP-Last
last word of multi-word noun phrase
NP-Internal
if a noun phrase is longer than three words, the words between the first and the last of the noun phrase are marked with this tag.

Following these tags, you can extract the three noun phrases of the above sentence: 'bloody failure of the police state', 'Ceausescu', and 'drawn-out civil war'.

Proper Noun Detection

Often it is desirable to find names from a text. The solution is to use the information from Machinese Phrase Tagger's proper noun detection. The output formats will include the proper noun information alongside with the morphological information. The tag used to mark proper nouns is Prop and in the prose output the same is described with the denotation "proper noun". When combined with the noun phrase information, it is possible to pick up multiword names, for example pairs of first and last names.