×

Finding Similarity using Word Mover’s distance for NLP

Word Mover Disatance

In a previous article, we had discussed and implemented the cosine similarity. With the help of cosine similarity, we were able to know if the two documents were similar or not. We were able to find how close or far the documents are to one another.

Word mover’s distance

Just like cosine similarity, word mover’s distance is another method of establishing similarity or dissimilarity between documents. This method is more relevant than cosine similarity which we had discussed previously especially when working with word embeddings.

The definition of word mover’s distance is that it defines the dissimilarity between two text documents as the minimum amount of distance that the embedded words on one document need to travel to reach the embedded words of another document.

Let’s look at what the definition means with an example. Take two sentences or we may call them two documents.

Sentence 1: Obama speaks to the media in Illinois.
Sentence 2: President greets the press in Chicago.

As you can see, we have considered two sentences. If you observe the sentences, you notice that the meaning of both sentences is the same. Suppose we were using cosine similarity to find the similarity between these sentences.

If we remove the stopwords in these two sentences (to, the, in), then there are no similar words in them. So the cosine similarity would probably give a bad result and would conclude that the two sentences are not similar because cosine similarity works on the principle that similar sentences have similar words.

But we know that even though these sentences don’t have similar words in them, the meaning of both of them is the same. In word mover’s distance, the embeddings of the words are calculated. As we know, the embeddings are mapped in an n-dimensional space. We also know that words with similar words are located closer in the n-dimensional space.

Word Movers Distance 1

So what word mover’s distance does is it calculates the distance between the words of these two documents in the n-dimensional space and we can visualize it as the words from the first document are traveling to the second document and the distance that they travel is what defines whether the documents are similar or not.

If the distance the words need to travel is less, that means the documents are similar and if the distance they travel is larger, then the documents are dissimilar.

Please note that the words don’t actually travel from one document to another, but with the help of a distance, a measure called the Euclidean distance the word mover’s algorithm can find the cumulative distance between them.

Implementing word mover’s distance using gensim

We will use gensim to implement word mover’s distance. We begin by importing the libraries and packages. You can import them using the code snippet below.

import gensim
from gensim.models import KeyedVectors
import os

We use the same pre-trained model we had used in a previous article on word embeddings.

In the code snippet below, on the first line, we download the model from a cloud storage service. On the second line, we change our directory to where we downloaded the model. On the third line we list the contents in the directory and on the last line, we set the model path.

!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
os.chdir("/root/input")
!ls -al
model_path = '/root/input/GoogleNews-vectors-negative300.bin.gz'

If the above code is run successfully, you may see an output as shown below. Please note that I am using Colab for running this code.

Colab Output 1

We now load the model we had downloaded. To load the model into memory use the code below. We pass the parameter of model_path we had set before.

model = KeyedVectors.load_word2vec_format(model_path, binary=True)

Now let’s define the sentences we had seen before. The first two sentences are the ones we had seen and have the same meaning. We also define a new third sentence that is different in meaning from the other two sentences. We will try to find how similar or dissimilar these sentences are using the word mover’s distance.

sentence_1 = "Obama speaks to the media in Illinois"
sentence_2 = "President greets the press in Chicago"
sentence_3 = "Nokia is my favorite company"

In the code snippet below we use the function .wmdistance() to calculate the word mover’s distance. We first calculate the word mover’s distance between sentences 1 and 2.

word_mover_distance = model.wmdistance(sentence_1, sentence_2)
word_mover_distance

Let’s check the output.

1.1642040735998236

Now we calculate the word mover’s distance between the first and the last sentence.

word_mover_distance = model.wmdistance(sentence_1, sentence_3)
word_mover_distance

We check the output.

1.2450872484500934

Comparing both the outputs above, we observe that the word mover’s distance between the first and second sentences is lower than the distance between the first and the third sentences. The lower the distance, the more similar are the sentences, and hence first and second sentences are similar as opposed to the first and third.

We now normalize the word embeddings for all the sentences and apply the word mover’s distance again to them. Normalizing will help us get a better measure of similarity. Normalize the embedding with the help of the code below.

model.init_sims(replace = True)

Now that we have normalized the embeddings, we calculate the distance as previously and observe the output.

First, we calculate the word mover’s distance between sentences 1 and 2. Remember that these sentences have similar meanings.

word_mover_distance = model.wmdistance(sentence_1, sentence_2)
word_mover_distance

The output for the above code is as follows.

0.4277553083600646

We now take measure the distance between sentences 2 and 3. Remember that these sentences do not have similar meanings.

word_mover_distance = model.wmdistance(sentence_1, sentence_3)
word_mover_distance

Output:
0.45212283373349177

As you can see that on comparing both the outputs we conclude that the sentences having similar meanings (sentences 1 and 2) have a comparatively lower word mover’s distance score as compared to sentences having different meanings (sentences 1 and 3) which have a higher word mover’s distance.

Here is the complete code block.

import gensim
from gensim.models import KeyedVectors
import os

!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
os.chdir("/root/input")
!ls -al
model_path = '/root/input/GoogleNews-vectors-negative300.bin.gz'

model = KeyedVectors.load_word2vec_format(model_path, binary=True)

sentence_1 = "Obama speaks to the media in Illinois"
sentence_2 = "President greets the press in Chicago"
sentence_3 = "Nokia is my favorite company"

word_mover_distance = model.wmdistance(sentence_1, sentence_2)
word_mover_distance

word_mover_distance = model.wmdistance(sentence_1, sentence_3)
word_mover_distance

model.init_sims(replace = True)

word_mover_distance = model.wmdistance(sentence_1, sentence_2)
word_mover_distance

word_mover_distance = model.wmdistance(sentence_1, sentence_3)
word_mover_distance

Final thoughts

In this article on word mover’s distance, we first saw what word mover’s distance is and how it differs from cosine similarity, then we saw how it’s calculated, and finally, we saw how to actually implement the word mover’s distance to find if two documents are similar or not.