If you remember, we had discussed in the previous articles that the first step towards vectorization (converting text to numbers) is tokenization. So what is the next step?
After splitting sentences into words i.e. tokenization we want to reduce the words to their base or root form. This is essentially what is meant by Stemming. Stemming is the technique or method of reducing words with similar meaning into their “stem” or “root” form.
To understand this concept better, think of a plant. A plant has a stem, leaves, flowers, etc. All the leaves are connected and flourish from the stem. The stem is the backbone of the plant and supports the various leaves and flowers. Eventually, all leaves can be traced back to the stem.
In a similar manner, there is a “stem” word that forms the basis for other advanced words. And these advanced words may have different usage and spellings but can all be traced back to the stem. This process of converting words into their “stem” form is known as stemming.
For example, take the words “computer”, “computerization” and “computerize” into consideration. These words have different spellings and usage but can all be traced back to their “stem” form which is nothing but “compute”. The prefixes “er”, “rization” and “prize” have been chopped off.
Now that we have a good understanding and intuition of what we mean by stemming, let us find the answer to a very logical question “why do we need stemming?”
For the computer to work with and understand the text better, we convert the text into numbers. This is the process of vectorization. The reason we stem the words is so that we can reduce the dimension of the resultant matrix that we feed to the model. If the words are shorter then the vector-matrix too will be of fewer dimensions.
If our vector-matrix is large i.e. it has a representation of a lot of lengthy words then this will give rise to a classical Machine Learning problem called “The curse of dimensionality”. Due to the curse of dimensionality, the vector representation of the text we are working on has to be minimized. Stemming and Lemmatization help us to achieve this.
Now, let’s look at how we can practically perform stemming on text data.
Stemming using the NLTK library
The NLTK library provides a convenient way for us to implement stemming. We will be covering 3 stemmers here. Let’s get into it.
1. Porter stemmer
This stemmer is a basic stemmer and was developed in the ’80s. It is not used in the production environment today, but it is a good stemmer to play around with for beginners.
I now suggest you to open a new Colab notebook or any IDE you prefer to code in.
from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer()
In the above code snippet, we are first importing the PorterStemmer from the NLTK library. In the next line, we are initializing the stemmer.
Now, let’s stem some words!
stemmer.stem("computer") Output: “comput”
As we can observe, the stemmer chops of the “er” part of the word “compute”. This shows that the PorterStemmer is a very basic stemmer.
stemmer.stem("cats") Output: “cat”
For the above code, the word “cats” has been converted to “cat” by chopping off “s”. It converted the word from plural to singular.
stemmer.stem("traditional") Output: “tradit”
The above code shows that the output of any Stemmer can be an invalid word. As we notice, the word “traditional” has been converted to “tradit” which is not even a word in English by chopping off the affix “ional”. This is one of the drawbacks of using stemmers.
Now let us take a complex example. Here we will convert some plurals words into their singular form using the PorterStemmer.
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned','humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization', 'traditional', 'reference', 'colonizer','plotted', 'having', 'generously']
from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() singles = [stemmer.stem(plural) for plural in plurals] print(' '.join(singles))
In the above code snippet, we have first defined a list of plural words and we have assigned it to a variable called “plurals”. Next, we have imported and initialized the stemmer.
Moving on, using list comprehension we have iterated over the list of words and applied stemming to each word.
Let’s check the output.
Output: caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener
As we can see in the output, a lot of plural words have not been stemmed correctly. By correctly I mean that many singular words are invalid words. For example, the word “flies” should actually be converted to “fly” but is rather converted to “fli” which is an invalid word.
2. Snowball stemmer
The Snowball stemmer is an improvement over the Porter stemmer. This stemmer is more aggressive than the Porter stemmer. Another thing to note here is that Porter stemmer primarily supports the English language but Snowball stemmer supports multiple languages.
To check the languages supported by this stemmer let’s run the following code.
from nltk.stem.snowball import SnowballStemmer print(SnowballStemmer.languages)
Output: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
From the output we can see that Snowball Stemmer is a multi-lingual stemmer.
Let’s apply the previous plural-singular example to this stemmer and observe the outputs.
from nltk.stem.snowball import SnowballStemmer stemmer_2 = SnowballStemmer(language="english")
In the above snippet, first as usual we import the necessary packages. Here we are interested in the Snowball stemmer. Next, we initialize the stemmer. If you notice, here we are passing an additional argument to the stemmer called language and specifying English.
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned','humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization', 'traditional', 'reference', 'colonizer','plotted', 'having', 'generously'] singles = [stemmer_2.stem(plural) for plural in plurals] print(' '.join(singles))
After defining the list of plural words, now we use the stemmer to stem each word in the list.
It’s time to check the output and compare the performance of this stemmer with the previous stemmer.
Output: caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous
As you can see, most of the words have been stemmed in the same way as the Porter stemmer. The word “generously” however has been stemmed to its base form “generous” correctly, as opposed to the Porter stemmer where it was stemmed to “gener”.
The above example shows that the Snowball stemmer can be considered as an improvement over the Porter Stemmer.
In this article, we were introduced to Stemming and how it proves to be beneficial in data preprocessing. We also learned about two popular and widely used stemmers and saw how to use them. In the next article, we will learn about lemmatization.
Thanks for reading.