Tokenization,Stemming and Lemmitization

Aditya B
1 min readFeb 2, 2021

--

Tokenization is the process of breaking down a text into words. Tokenization can happen on any character, however the most common way of tokenization is to do it on space character.

Stemming is a crude way of chopping of an end to get base word and often includes removal of derivational affixes. A derivational affix is an affix by means of which one word is formed (derived) from another. The derived word is often of a different word class from the original.The most common algorithm used for the purpose is Porter’s Algorithm.

Lemmatization performs vocabulary and morphological analysis of the word and is normally aimed at removing inflectional endings only. An inflectional ending is a group of letters added to the end of a word to change its meaning. Some inflectional endings are: -s. bat. bats.

Since stemming occurs based on a set of rules, the root word returned by stemming might not always be a word of the english language. Lemmatization on the other hand reduces the inflected words properly ensuring that the root word belongs to english language.

--

--

Aditya B

Passionate author, strategic investor, financial advisor