Lemmatization is the process wherein the context is used to convert a word to its meaningful base or root form. Now, let’s try to simplify the above formal definition to get a better intuition of Lemmatization.
The word “Lemmatization” is itself made of the base word “Lemma”. In Linguistics (a field of study on which NLP is based) a lemma is a meaningful base word or a root word that forms the basis for other words. For example, the lemma of the words “playing” and “played” is “play”.
In the previous article where we covered stemming, the base form of the word was called a “stem” and here in Lemmatization, it is called a “lemma”. So what’s the difference between a stemmed word and a lemmatized word?
Difference between stemming and lemmatization
As you remember, in the previous tutorial when we saw a few examples of stemmed words, a lot of the resulting words didn’t make sense. They were invalid words. For example, the word “computer” was stemmed to the word “comput”. Even my spell checker is showing me an error indicating that “comput” is an invalid word.
Thus stemmed words may result in invalid words but lemmatized words always result in meaningful words. The reason lemmatized words result in valid words is that it checks for these words against a dictionary. It returns the dictionary forms of the words.
Another difference between stemming and lemmatization is that in stemming the words are reduced to their “stems” using crude methods like chopping off “ing”, “ed”, “er”, etc. from the words. In contrast, when we lemmatize a word we are checking for the dictionary form of the word. This helps us to lemmatize the word “studies” to “study” correctly.
Let us take an example to understand how these two algorithms differ in their work. Suppose we want to reduce the word “better” to its root form. We know that the root form of “better” is “good”. A stemming algorithm may try to chop off affixes(suffixes and prefixes) of the word “better” resulting in “bett”, “bet” or simply “better”.
The lemmatization algorithm however will check the word “better” against a dictionary and it will sense that the word “better” actually has a base form called “good”. A stemming algorithm will not be able to return the root form in this way but a lemmatization algorithm will.
This is essentially the difference between the stemming and the lemmatization algorithm.
Okay! Enough theory, let’s get coding.
Lemmatization in action
One of the most commonly used lemmatizer is the Wordnet lemmatizer. Apart from it, the other used lemmatizers include the Spacy lemmatizer, the TextBlob lemmatizer, the Gensim lemmatizer, etc. Let’s start with the WordNet lemmatizer.
Using the WordNet lemmatizer
WordNet is a lexical database of the English language. Imagine it as a huge dictionary containing all prominent English words along with their meanings. In this database nouns, verbs, adjectives, and adverbs are grouped together as sets. This helps in the lemmatization process.
The WordNet can be downloaded and used. The NLTK library provides us with an interface to do just that.
import nltk
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer
Output:
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.
In the above code snippet, we import NLTK and then download the WordNet database. On the third line, we import the WordNet lemmatizer. The output tells us that we are ready to perform lemmatization.
lemmatizer = WordNetLemmatizer()
sentence = "We are putting in the efforts to enhance our understanding of Lemmatization"
tokens = sentence.split()
print("tokens :", tokens)
In this piece of code, we are first initializing the WordNet lemmatizer. Next, we are defining a sentence and assigning it to a variable called a sentence. Then we are tokenizing the sentence using the split() method. Finally, we print the tokens.
Let’s look at the output.
Output:
tokens : ['We', 'are', 'putting', 'in', 'the', 'efforts', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']
The output shows our tokenized sentence. Now, let’s apply lemmatization to these tokens.
lemmatized_tokens = " ".join([lemmatizer.lemmatize(token) for token in tokens])
lemmatized_tokens
Using the join()
method we are joining our lemmatized text that we had split in the previous step.
Output:
'We are putting in the effort to enhance our understanding of Lemmatization'
Observing the output, we see that the word “efforts” has been converted to “effort”.
Even though we couldn’t observe much difference between the input and the lemmatized output sentences, we see that all words are meaningful and no words are invalid.
But the question arises “can we do better?” Yes, we can! To do better we will have to add and pass the Parts-of-Speech. Don’t worry about what parts-of-speech is, for now, we’ll cover it in detail in the next article. For now, just remember parts of speech are classes like nouns, verbs, etc which make up the sentence.
Let’s see how to compute the parts of speech.
First, let’s import the tagger. The tagger we will be using is the “averaged_perceptron_tagger” which performs decently. On the next line, we pass our tokens to this tagger and observe the output.
Again, we will cover POS tagging in detail in the next article, so focus only on lemmatization here.
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(tokens)
pos_tags
Observe the output. We can see that the tokens are classified as nouns (NN), verbs (VBP), prepositions (in) etc.
Let’s that we know something about POS tags and how to compute them, let’s see how to lemmatize words more accurately. For that, we will use another library that makes our work easier for us.
Using the spaCy lemmatizer
Using the spaCy lemmatizer will make it easier for us to lemmatize words more accurately. We will take the same sentence we had taken previously and this time use spaCy to lemmatize.
First, let’s import the library and load the model.
import spacy
nlp = spacy.load('en')
Next, let’s pass our sentence to the model and lemmatize the tokens. One important thing to note here is that the spaCy model is an advanced model and automatically takes care of tokenizing the sentence. We don’t have to tokenize the text manually.
doc = nlp("We are putting in the efforts to enhance our understanding of Lemmatization")
" ".join([token.lemma_ for token in doc])
In the above code snippet, we are passing the same sentence used previously to the spaCy model. The model processes the text and saves it as tokens in the variable “doc”. In the last line, we are lemmatizing the tokens and joining them together.
Let’s check the output.
'-PRON- be put in the effort to enhance -PRON- understanding of lemmatization'
If you compare this output with the previous one, you will see that “are” is lemmatized to “be”, “putting” to “put”, and “efforts” to “effort”. This is a major improvement because the spaCy model also took care of POS tagging for us in the backend and hence we could lemmatize “are” to “be”. What a major improvement.
Final thoughts
In this article, we went throught the meaning of lemmatization and how differs from stemming. We also performed some practicals where we found out that in order for lemmatization algorithms to perform better, they also need to be passed the parts of speech. In the next article we will focus on parts of speech.
Thanks for reading.