You are here

Connexor Machinese Tokenizer

Machinese Tokenizer is a set of program components that performs basic text analysis tasks at very high speed and provides relevant information about words to volume-intensive applications.

Machinese Tokenizer splits raw text into understandable word units and provides the possible base forms and classes for words. To allow quick analysis It does not disambiguate different word senses, if this functionality is needed for your application, please look at other Machinese analysers which provide more sophisticated linguistic analysis. The simple analysis approach of Machinese Tokenizer should still suffice for many applications, for example to help search engines to match also inflected word forms.

For example the sentence "This is a test." is analysed in the following way:

0       4       This    this    PRON
5       2       is      be      V
8       1       a       a       DET
10      4       test    test    N               test    V

The first column shows the token's position in the text (counted in characters), the next one tells the length of the token, third column is for the text form and the remaining columns list the baseform(s) and the tag denoting the part of speech (PRON = pronoun, V = verb, DET = determiner, N = noun). If the word has multiple senses, the analysis also includes multiple baseform and part of speech columns.

Users can also customize this output to omit certain columns if some of the information is irrelevant for the application where this analysis is needed.