How to Read Machinese Analysers' Output

In the following sections of this manual, you will find answers on how to read Machinese analysers' output. In the examples, the command line programs are used, but for most part, you can archive the same using the Machinese COM or library application programming interfaces.

Tokenisation

The most basic information Machinese analysers return is the boundaries of words and sentences. Tokenisation rules in Machinese analysers may divide some of the words in the input text into two tokens - for example, "isn't" is analysed to contain two tokens "is" and "n't". Sentence breaks are marked with special tags.

In Machinese Phrase Tagger, the sentence boundaries are marked with tag SB or self-explaining phrase "sentence boundary" if using the prose output format.

Machinese Syntax marks sentence boundaries with the tag <s> and detects paragraph ends and shows these using the tag <p>.

For example, Machinese Phrase Tagger gives the following analysis for the sentence "Isn't Dr. Spock here?":

0	2	Is	be	@MAIN	V IND PRES
2	3	n't	not	@ADVL	ADV
6	3	Dr.	dr.	@NH	N Abbr NP-Single
10	5	Spock	Spock	@NH	N Prop NP-Single
16	4	here	here	@ADVL	ADV
20	1	?	?	SB

The first two columns in this default output format show the start position of the token and its length in characters. From this information, you can see that the first two tokens are written together as a single word in the input text. Also, you can see that the dot in "Dr." does not mark the end of the sentence, but the question mark does as it is marked with the tag SB.

Base Forms

In most languages, words are inflected in various case forms to form sensible phrases. However, in many applications it is desirable to find and match all occurrences of the same word despite the fact that the inflected word forms in the text are quite different - take for example irregular verbs like "be", which can also appear in forms like "is", "was" and "were". Luckily, Machinese analysers return the base form (also known as lemma) for any word.

All Machinese output formats will show the base form information, although there are small differences in the display properties. For example, in the analysis in the default output format of Machinese Phrase Tagger, the base form is shown in the fourth column.

0	5	Uuden	uusi	@PREMOD	A NP-First
6	11	bussiyhtiön	bussi yhtiö	@NH	N NP-Last
18	2	on	olla	@MAIN	V IND PRES
21	5	määrä	määrä	@NH	N NP-Single
27	8	aloittaa	alkaa	@MAIN	V INF
36	11	toimintansa	toiminta	@NH	N PL NP-Single
48	4	heti	heti	@ADVL	ADV
53	6	vuoden	vuosi	@PREMOD	N NP-First
60	8	vaihteen	vaihde	@NH	N NP-Last
69	7	jälkeen	jälkeen	@PREMARK	PREP
76	1	.	.	SB

One noteworthy detail is also that Machinese analysers mark the word boundaries of compound words (like the word bussiyhtiö in the above Finnish language example) with the non-breaking space character (U+00A0).

Part of Speech

Part of speech (PoS) or word class information describes what role the word has in the phrase or sentence. As there are several words which can occur in text in multiple PoS categories, Machinese analysers will use the context information to determine which role the word has in the sentence.

All Machinese output formats will show the part-of-speech information, although there are small differences in the display properties.

For example the following Machinese Phrase Tagger analysis for sentence 'I saw him walking with a saw.' demonstrates, that the analyser detects the two different senses of word 'saw':

0	1	I	I	@NH	PRON
2	3	saw	see	@MAIN	V IND PAST
6	3	him	he	@NH	PRON
10	7	walking	walk	@MAIN	V PCP PROG
18	4	with	with	@PREMARK	PREP
23	1	a	a	@PREMOD	DET
25	3	saw	saw	@NH	N NP-Single
28	1	.	.	SB

For more detailed description of PoS analysis in Machinese analysers, please see the Machinese Language Model manual.

Morphology

Morphological information tells the details of the word forms used in the text.

Please note that Machinese Phrase Tagger shows only a limited number of morphological features while Machinese Syntax offers much more detailed analysis. For example, Machinese Phrase Tagger gives in the prose output mode the following analysis for the French language sentence 'C'est la vie.':

0	2	C'	ce	@NH	PRON
2	3	est	être	@MAIN	V IND PRES
6	2	la	la	@PREMOD	DET
9	3	vie	vie	@NH	N NP-Single
12	1	.	.	SB

While Machinese Syntax provides the following analysis, which includes many morphological features which Machinese Phrase Tagger omitted:

1	C'	ce	subj>2	@NH PRON Dem MSC SG
2	est	être	main>0	@MAIN V IND PRES SG P3
3	la	la	det>4	@PREMOD DET Art Def FEM SG
4	vie	vie	comp>2	@NH N FEM SG
5	.	.
6	<s>	<s>

Detailed description of these analyses can be found in the Machinese Language Model manual.

Syntax

Whereas part of speech and morphology give information on individual words, syntax describes the information about relations between words within phrases or sentences. Machinese Phrase Tagger tells what syntactic function each word has. Machinese Syntax gives more detailed description, which provides details on what role the syntactic relation has and which words are adjoined by this relation.

If you compare the analyses shown in the morphology chapter above, you can see that the fourth field in the Machinese Syntax output lists the abbreviated syntax relation name and number designating which word the relation points to.

Detailed description of what syntactic analyses Machinese analysers will produce can be found in the Machinese Language Model manual.

Noun Phrases

In information retrieval applications, the interesting part of the text is usually nouns or noun phrases, as these describe which subjects and objects the text is about. Single nouns are obviously easy to pick from text (see information on part of speech), but often noun phrases tell more as they combine the information of adjacent words. Machinese Phrase Tagger makes finding noun phrases easy as it includes noun phrase detection, which marks where a noun phrase starts, ends and which words in between belong to that noun phrase.

Take for example sentence 'The bloody failure of the police state run by Ceausescu could have become a drawn-out civil war.' Now, when analysed with Machinese Phrase Tagger, the default output format looks like following:

0	3	The	the	@PREMOD	DET
4	6	bloody	bloody	@PREMOD	A NP-First
11	7	failure	failure	@NH	N NP-Internal
19	2	of	of	@POSTMOD	PREP NP-Internal
22	3	the	the	@PREMOD	DET NP-Internal
26	6	police	police	@PREMOD	N PL NP-Internal
33	5	state	state	@NH	N NP-Last
39	3	run	run	@MAIN	V PCP PERF
43	2	by	by	@PREMARK	PREP
46	9	Ceausescu	Ceausescu	@NH	N Prop NP-Single
56	5	could	could	@AUX	V IND PRES
62	4	have	have	@AUX	V INF
67	6	become	become	@MAIN	V PCP PERF
74	1	a	a	@PREMOD	DET
77	9	drawn-out	drawn out	@PREMOD	A NP-First
87	5	civil	civil	@PREMOD	A NP-Internal
93	3	war	war	@NH	N NP-Last
96	1	.	.	SB

In the output, you can see the following tags, which denote parts of noun phrase constructs:

NP-Single: a single noun.
NP-First: first word of multi-word noun phrase
NP-Last: last word of multi-word noun phrase
NP-Internal: if a noun phrase is longer than three words, the words between the first and the last of the noun phrase are marked with this tag.

Following these tags, you can extract the three noun phrases of the above sentence: 'bloody failure of the police state', 'Ceausescu', and 'drawn-out civil war'.

Proper Noun Detection

Often it is desirable to find names from a text. The solution is to use the information from Machinese Phrase Tagger's proper noun detection. The output formats will include the proper noun information alongside with the morphological information. The tag used to mark proper nouns is Prop and in the prose output the same is described with the denotation "proper noun". When combined with the noun phrase information, it is possible to pick up multiword names, for example pairs of first and last names.

Log in or register to post comments

You are here

How to Read Machinese Analysers' Output

Tokenisation

Base Forms

Part of Speech

Morphology

Syntax

Noun Phrases

Proper Noun Detection

Navigation

You are here

How to Read Machinese Analysers' Output

Tokenisation

Base Forms

Part of Speech

Morphology

Syntax

Noun Phrases

Proper Noun Detection

Navigation

User login