In the previous article, we learned about word embeddings and saw a glimpse of the Word2Vec model. If you recall, we had used an already trained model by Google which was great, but what if you want to train your own Word2Vec model on your dataset? This is what we will be learning in this tutorial.
Ways to train a Word2Vec model
The Word2Vec model is an unsupervised method that makes use of a neural network model (deep learning) as the basis of its architecture. The Word2Vec model can be trained by using two approaches. They are as follows.
- The Skip-gram method
- The Continuous Bag-of-Words (CBOW) method
Let us discuss both these methods in detail.
The Skip-gram method
In the previous post, we had discussed that word embeddings take into account the context or the neighborhood of words surrounding them. We now build on that previous knowledge when we say that in the Skip-gram method to train a Word2Vec model, we predict the context word with the target word as input.
The target word is a prominent word in a sentence and we would like to find its relationship to other words in its vicinity. So now that we know what a target word is, any words in its vicinity or surrounding are called context words. In the sentence “A stitch in time saves nine” we can say that “time” is a target word and others are context words.
How do we know the number of words in the context of it? We define the parameter called
window_size. With the help of this parameter, we get to know how many words in the vicinity of the target word should be taken into consideration.
So the model basically tries to predict the context word based on the input of a target word. This is what the model learns and refines until it predicts the correct context word for a given target word.
The CBOW method
The CBOW method is the second method that is used for the training of the Word2Vec model. The working of this method is similar to the Skip-gram method.
The only difference between the above-mentioned method and the CBOW method is that in this method the model is given the input of a context word and it tries to predict the output. In the Skip-gram method, we were passing the target word as input and trying to predict the context word.
So, in the CBOW method, the vector representation of a context word is passed to the model to predict the target word and in the Spik-gram method, the vector representation of the target word is passed to the model to predict the context word.
In this article, we will train the Word2Vec model using the Skip-gram method.
Building the Word2Vec model
To build a Word2Vec model, we will use the same library we had used previously. That library is Gensim.
Begin by making sure that you have gensim installed in your system. We import the necessary libraries and packages.
!pip install gensim==4.0.1 from gensim.models import Word2Vec
After importing the libraries, next, we define a sample text document. This document contains a list of sentences which has already been tokenized.
sentences = [["A", "language", "is", "a", "structured", "system", "of", "communication", "used", "by", "humans"], ["The", "scientific", "study", "of", "language", "is", "called", "linguistics"], ["Languages", "spoken", "in", "India", "belong", "to", "several", "language", "families"]]
On printing out the sentences and we see that it contains a list of sentences with tokenized words.
We now build a basic Word2Vec model using the data defined above. We call the Word2Vec model use had imported from Gensim and pass our sentences variable to it. Our Word2Vec model will be built on this data. As a second parameter, we pass the min_count with a value of 1.
What this min_count parameter specifies is the minimum frequency of words to be included in the vocabulary. So when we specify the min_count as 1 we are telling the model to add all the words to the vocabulary which are repeated at least once in the data.
model = Word2Vec(sentences, min_count=1)
On printing out the model, we come to know that it is a gensim Word2Vec model.
model output: <gensim.models.word2vec.Word2Vec at 0x7f49d0b7aa90>
We now visualize the embedding of individual words. We visualize the embedding of the word “scientific” which is present in our data.
The output is as follows.
From the above output, we see that our Word2Vec model provides a vector representation for each word in the data.
We now check the size of the vector that the model outputs for each word.
model.vector_size Output: 100
We see that the default vector size that is set is 100. You can change this parameter according to your needs.
Now, we check the length of words in our vocabulary.
len(model.wv.index_to_key) Output: 24
The output shows that our vocabulary consists of 24 unique words from the data we had given to the model.
We now print out our vocabulary.
model.wv.index_to_key # Output: ['language', 'is', 'of', 'families', 'several', 'a', 'structured', 'system', 'communication', 'used', 'by' ...
If you check the number of words in the vocabulary, you will find that there are 24 words as we had found out previously.
Now, we tweak the parameters and observe the results.
So now, we have set the min_count to 2, which means the model will only take those words into consideration that are repeated at least twice in the dataset.
The second parameter shows that we have modified the vector size from 100 to 300. So, the vector size for each word will be 300 dimensions.
model = Word2Vec(sentences, min_count=2, vector_size = 300)
After running the above code, we print out the embeddings of the word “language”. Note that language is present three times in our data and hence it is included in the vocabulary.
The output is as follows.
array([-1.78742412e-04, 7.88100588e-05, 1.70111656e-03, 3.00309109e-03, -3.10098333e-03, -2.37226952e-03, 2.15295702e-03, 2.99099600e-03, -1.67180935e-03, -1.25445763e-03, 2.46016821e-03, -5.11157501e-04, ...
Note that the output is of 300 dimensions and I have just included a part of it here. In the model now, each word in the document is represented by a 300 length vector.
We check the size of our vocabulary.
len(model.wv.index_to_key) Output: 3
We check the vocabulary now.
model.wv.index_to_key Output: ['language', 'of', 'is']
As you can see, our vocabulary now consists of 3 words. Due to the min-count parameter set to 2, only these words we added to the vocabulary because they appear at least twice in our data.
Finally, we check the vector size.
model.vector_size Output: 300
In this article, we have continued learning about word embeddings. We learned about the ways we can train the Word2Vec models and we trained our own basic Word2Vec embedding model using the Gensim library.
Thanks for reading.