You are here

What is a Natural Language Tagger

Connexor's NLP tagger contains all basic level tasks in natural language processing. It runs several tasks in one run to answer the following questions:

  • What is a word? What is a sentence?
  • How to recognize various forms of the same word?
  • What are the meaningful word compounds in the text?
  • What are the attributes of a words?

Recognizing a word is not always as simple task as one would think. Sometimes a sequence of letters between whitespaces is not a meaningful unit. Consider, for example the sequence New York-based. Dividing this into two tokens or words: New and York-based would not make much sense and would cause troubles in any later stage of processing. This task is called tokenization.

Recognizing a sentece is a small thing that makes life easier in two ways. First, the rest of the analysis process works better when a meaningful processing unit is recognized, and second, higher level applications on the top of basic NLP perform better.

In most languages words may have several different forms. While in English the number of forms is quite limited: the word small has inflected forms smaller and smallest, in Finnish this same word pieni has some 18.000 different inflected forms. This task is called morphological analysis or lemmatization depending on if we are using this in a bigger natural language parsing system or in a search application.

When one is creating an intelligent application, it is not about words. It is about things. Or it can be about ontologies and linked data between many domains. The NLP tagger should support this by making good base analysis that makes it easy to apply it to external semantic information. Consider, for instance, word compounds like New York and White House. The words new and white do not contain any relevant information. It is not a challenge with proper nouns only but also in words coming from foreign sources with a rare usage of the words (like a priori), or even in any text where you do not even notice when reading it: olive oil is not the same as crude oil. This task is called named entity recoginition and noun phrase detection.

Our tagging program is called Machinese Phrase Tagger.
The supported languages are English, German, French, Spanish, Italian, Dutch, Russian, Swedish, Danish, Norwegian and Finnish.