In the previous two articles, we discussed two algorithms by which we convert text into mathematical representations. After converting the text into a suitable mathematical form, how can we know if two or more texts are similar or dissimilar? Similarity measures help us to achieve this.
So the problem is that we want to find if two or more text documents are similar or not.
So one way to solve this problem is to check the words in both documents. If the majority of the words in both documents are similar then that means the documents are similar. And if the words in both the documents are dissimilar then the documents are dissimilar as well.
Cosine similarity
We have learned about vectors and matrices. We know that vectors are one-dimensional arrays. They are also represented in Trigonometry and when we plot two vectors we can measure the angle between them.
One important thing to note about vectors in geometrical representations is that they have both magnitude and direction. This means that we can find and measure the angle between two vectors and conclude if two vectors are similar or not. How do we do that?
We take the cosine measure of the angle between the vectors. Recollect from high school maths that the cosine or the cos is 1 when the angle is 0, cos of 90 degrees in 0, and cos of 180 degrees is -1. We will need this information later on.
Cosine similarity helps in measuring the cosine of the angles between two vectors. The value of cosine similarity always lies between the range -1 to +1. The value of +1 indicates that the vectors into consideration are perfectly similar. Whereas the value of -1 indicates that the vectors into consideration are perfectly dissimilar or opposite to each other.
All of the above math that we have gone through means that when we check two documents for similarity or how similar they are and if we get a cosine similarity value of +1 it means that the documents are similar. And if we get a cosine similarity value of -1 it means they are not similar.
The formula for calculating Cosine similarity is given by
In the above formula, A and B are two vectors. The numerator denotes the dot product or the scalar product of these vectors and the denominator denotes the magnitude of these vectors. When we divide the dot product by the magnitude, we get the Cosine of the angle between them.
Cosine similarity is very useful in NLP for a lot of tasks. These tasks include Semantic Textual Similarity (STS), Question-Answering, document summarization, etc. It is a fundamental concept in NLP.
Cosine similarity using Python
Finding cosine similarity between two vectors
First, we implement the above-mentioned Cosine similarity formula using Python code. Then we’ll see an example of how we can use it to find the similarity between two vectors.
We first start with the imports.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
After importing, we define a function as follows. This cosine_similarity function takes in two parameters and they are nothing but two vectors. So the function takes in two vectors and converts these two vectors into NumPy arrays.
After converting them into Numpy arrays, on the last line, we implement the same formula we had discussed in the theory section of this article. On the last line of the code, we first take the dot products of the two vectors and then divide them by taking the magnitudes of the vectors.
To take the magnitude in Python, the vectors are first squared, then multiplied with each other, and finally, the squared root is taken.
def cosine_similarity(vector1, vector2):
vector1 = np.array(vector1)
vector2 = np.array(vector2)
return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2)))
Now that we have defined our function, we take two arrays as vectors and try to find the cosine similarity between them.
Define two arrays as follows. We define d1 and d2 as two vectors. Then we convert these vectors into NumPy arrays. On the last line, we print these arrays.
d1 = (5,0,5,6,3,9,8,7,5,6)
d2 = (3,0,2,4,6,9,8,5,2,1)
d1 = np.array(d1)
d2 = np.array(d2)
print(d1)
print(d2)
Output:
[5 0 5 6 3 9 8 7 5 6]
[3 0 2 4 6 9 8 5 2 1]
We now call the cosine similarity function we had defined previously and pass d1 and d2 as two vector parameters. This will give the cosine similarity between them. To calculate the cosine similarity, run the code snippet below.
cosine_similarity(d1, d2)
Output:
0.9074362105351957
On observing the output we come to know that the two vectors are quite similar to each other. As we had seen in the theory, when the cosine similarity is close to 1 it means the two vectors are very similar.
Another important thing to note in this example is that the size of both vectors is the same. If you take and input vectors of two different lengths then you would get an output saying that the vector sizes do not match. Hence, this shows that to calculate the cosine similarity, both the vectors need to be of the same size.
Finding cosine similarity between documents in a corpus
We now define a text corpus made up of three documents. So the corpus text
consists of three documents. We have chosen these documents from the Wikipedia pages. If you read the documents you will find that the first and the second document belong to the same topic of Trigonometry. But the third one is of a random topic.
So when we calculate the cosine similarity, we expect the cosine similarity score to be higher for documents one and two and less for other combinations.
text = (""" Trigonometry is a branch of mathematics that studies relationships between side lengths and angles of triangles The field emerged in the Hellenistic world during the 3rd century BC from applications""",
""" Driven by the demands of navigation and the growing need for accurate maps of large geographic areas trigonometry grew into a major branch of mathematics Bartholomaeus Pitiscus was the first""",
""" One of Los Angeles oldest continuing operating restaurants The Apple Pan is also notable as the basis for the popular Johnny Rockets restaurant chain Johnny Rockets founder Ronn Teitlebaum claimed""")
We next convert the corpus to a series object for easier computation. Then we define the same cosine similarity calculation function again.
corpus = pd.Series(text)
# Cosine Similarity Calculation
def cosine_similarity(vector1, vector2):
vector1 = np.array(vector1)
vector2 = np.array(vector2)
return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2)))
We now want to convert our text corpus into a suitable mathematical representation. So for this purpose, we use the CountVectorizer here. Feel free to try and experiment by using TF-IDF vectorizer too.
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)
So our bow_matrix
is a sparse matrix that contains the representation we need.
We now quickly check the words in our vocabulary.
feature_names_count = vectorizer.get_feature_names()
feature_names_count
Output:
['3rd',
'accurate',
'also',
'and',
'angeles',
'angles',
'apple',
'applications',
'areas'
...
Now, we take a look at the mathematical representation of the three documents and as a matter of fact, a look at our whole corpus representation.
features_array_count = bow_matrix.toarray()
features_array_count
We quickly check the shape of our resulting matrix.
bow_matrix.shape
Output:
(3, 67)
Now, look at the code snippet below. We write a nested for loop. In the first for loop, we iterate over the documents to choose our first document vector to be given to the cosine similarity function. In the second and nested for loop, we iterate over the documents to choose the second document vector.
Then, on the second-last line, we print a statement and on the last line, we call our previously defined cosine similarity function and pass our first and second document vectors as defined by our iterators at each loop.
So basically, we will get a cosine similarity score for each document for every other document in the corpus. Run the code below to get an output.
for i in range(bow_matrix.shape[0]):
for j in range(i + 1, bow_matrix.shape[0]):
print("The cosine similarity between the documents ", i, "and", j, "is: ",
cosine_similarity(bow_matrix.toarray()[i], bow_matrix.toarray()[j]))
We now look at the output.
The cosine similarity between the documents 0 and 1 is: 0.48782135766494206
The cosine similarity between the documents 0 and 2 is: 0.3119251469460218
The cosine similarity between the documents 1 and 2 is: 0.32101211891111664
The output is just as we had expected it to be. If you recall, the first two documents were based on a similar topic and the output shows that the cosine similarity score between documents 0 and 1 is higher compared to the similarity score between say the second and the third document.
This is because document 1 and document 2 contain similar words and the function was able to find the angles between these vectors and hence the higher score compared to other document pairs.
Final thoughts
In this informative article, we have seen one of the most commonly used similarity measures in NLP called the Cosine Similarity. We went over the mathematical basis of its working, we found the similarity scores between a pair of vectors using Python, and finally, we saw how to find the similarity scores between documents in a corpus.
Thanks for reading.