×

Doc2Vec in Natural Language Processing

In the previous articles, we have seen how to generate vectors for words in the form of word embeddings. For that task, we had used the Word2Vec model. But what if we want to generate embeddings for a whole paragraph or the whole document?

Doc2Vec

Paragraph-level embeddings can be generated with the help of Doc2Vec. But here, as opposed to the Word2Vec model, we use document representations, not just word representations. That’s the main difference between Word2Vec and Doc2Vec.

Doc2Vec is an unsupervised model just like the Word2Vec model. You can think of the Doc2Vec model as an extension to the Word2Vec model. In Doc2Vec we train the model to predict words in the document.

We now look at how the document vectors are generated. There are two ways to build paragraph vectors.

The Distributed Memory Model of Paragraph Vectors (PV-DM)

This is one way to generate a document or paragraph-level vectors. If you recall, we had learned about the CBOW (Continuous Bag of Words) method to train the Word2Vec model in the previous articles. The PV-DM method works just like the CBOW method.

Just like in CBOW in PV-DM too we try to predict the target word based on the inputs given as context words. So, we give the inputs as context words and try to predict the target word/words.

Pvdm Doc2vec

If you look at the image above, the context words in the sentence given as input are “the cat sat” and the word the model predicts is “on”. One additional block you see is the paragraph ID block that helps in the prediction of the word. The paragraph vector is added to the word vectors to get the document matrix.

We can also take the average of the paragraph vectors and the word vectors and this average is then used to predict the word.

The Distributed Bag-of-Words Model of Paragraph Vectors (PV-DBOW)

This method is analogous to the Skip-gram training method used in the Word2Vec model. In the Skip-gram method, we use the target word as input to predict the context words.

The PV-DBOW method is analogous to the Skip-gram method, but here we use the paragraph or document vectors to predict the context words. This is illustrated in the image below.

Pvdbow Image

As you can see in the image above, in this training method the paragraph or document vectors are used to predict the next word “on” after the sequence “the cat sat”. The paragraph vector is trained by predicting words in the paragraph itself.

Out of the two methods mentioned above, the PV-DBOW model is simpler and memory-efficient. The PV-DM model is superior and usually gives a much better result as compared to the PV-DBOW method. It is also recommended to combine the representations obtained from both models and use them to get a better result.

Building a Doc2Vec model

For building a Doc2Vec model we will be making use of the Gensim library. We begin by doing the necessary imports.

We import the latest version of Gensim which is 4.0.1 as of the time of writing this article. Then we import the common_texts which is a small sample corpus provided by Gensim. On the last line, we import the Doc2Vec model and the TaggeDocument.

!pip install gensim==4.0.1
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

First, let’s check the sample corpus. We print it out.

common_texts

# Output:
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

The corpus contains several small documents. Notice that these have already been tokenized. We will be working with this data.

The Doc2Vec model requires the input data to be in a certain format. That is why we have imported the TaggedDocument module. This will tag each document and convert the documents into a TaggedDocument format.

In the code snippet below, using a list comprehension we run a for loop over the corpus and use the enumerate function to get the documents and their indexes together which we then pass to the TaggedDocument function.

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
documents

Let us observe the format in the output that we feed to the Doc2Vec model.

[TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]),
 TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]),
 TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]),
 TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]),
 TaggedDocument(words=['user', 'response', 'time'], tags=[4]),
 TaggedDocument(words=['trees'], tags=[5]),
 TaggedDocument(words=['graph', 'trees'], tags=[6]),
 TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]),
 TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])]

As you can see the documents are now in a TaggedDocument format where we have words and their tags. The tags here are the index of the documents in the corpus.

We now want to feed this data to the model and train a Doc2Vec model on this data. We do that by using the 2 lines of code given below.

In the first line, we pass the documents that we had seen in the previous step as a parameter to the model. We pass the vector_size which is set to 5 to indicate that the document vector would be of size 5 i.e. it the document vector would contain 5 elements.

We set the min_count parameter to 1 indicating to use words that have appeared at least once in the corpus, we set workers equal to 4 to signify to use 4 threads for the computation. Lastly, we set the epochs parameters to 40 i.e. the model will run for 40 iterations. On the next line, we train the model.

model = Doc2Vec(documents, vector_size=5, min_count=1, workers=4, epochs = 40)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

Now that our model is trained, we will check the vector_size of each document.

model.vector_size

# Output:
5

The size of each document vector is 5.

We check the number of documents used for training the model.

len(model.dv)

# Output:
9

We now check the vocabulary size.

len(model.wv.index_to_key)

# Output;
12

We have 12 words in our vocabulary. We now check those words by printing them out.

model.wv.key_to_index

# Output:
{'computer': 9,
 'eps': 5,
 'graph': 1,
 'human': 11,
 'interface': 10,
 'minors': 4,
 'response': 7,
 'survey': 8,
 'system': 0,
 'time': 6,
 'trees': 2,
 'user': 3}

As you can see these are the 12 words in our vocabulary. The numbers in front of them are not their frequency but their indices.

Now that we have trained our Doc2Vec model on a corpus, we will test it on a sentence that is not present in our training corpus. Note that the words in the sentence are present in our vocabulary. We expect the vector size of this new output to be 5 as we had set it while training.

vector = model.infer_vector(['user', 'interface', 'for', 'computer'])

Checking the output.

vector

 # Output:
array([-0.01459823, -0.07689106,  0.0033133 ,  0.0939738 , -0.05971007], dtype=float32)

We see that the model works perfectly and gives us a representation of a new document with the correct vector size i.e. 5.

Now we switch between the PV-DM and the PV-DBOW method. To use PV-DM for the training, set the dm parameter equals 1, and to use PV-DBOW for the training, set the dm parameter equals 0.

In the code snippet below, we set the vector_size equals 50, the min_count equals 2 so words that appear at least twice in the data will be added to the vocabulary and we set epochs equals 40. After that, we set the window parameter equals 2 and this is the parameter that will determine the number of context words in the vicinity that should be taken into consideration.

Continuing, we set dm equals 1, so we make use of the PV-DM method for training, we set the learning rate equals 0.3, we set the min_alpha equals 0.05 so that means the training will begin with a learning rate of 0.3 but as the training progresses it will reduce to 0.05.

Finally, we set the dm_concat equals 1, which means we concatenate the document vector to the word vectors to predict the target word as we had discussed in the theory section. To take the mean of the context word vectors, you can set the dm_mean parameter equals 1 and to take the sum, set dm_mean equals 0.

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, alpha=0.3, min_alpha=0.05, dm_concat=1)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

We now apply the model to get the vectors of a new sentence.

vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

After running the above code block, we get an output as follows.

[ 0.04339316 -0.20673728 -0.11588623  0.01476944  0.1611296   0.06557784
  0.03722077 -0.2631285  -0.00124143  0.17288627 -0.11253629 -0.01180305
 -0.18481328 -0.16214329 -0.00626908 -0.15843567  0.04118129  0.08459863
 -0.04379971 -0.18484151  0.01508713  0.02935403 -0.08522391  0.02278906
 -0.23046787 -0.07823569 -0.26988807 -0.00495168  0.0727071  -0.16758423
  0.17579406 -0.03017909  0.05711632 -0.2073381   0.08806376 -0.14307384
  0.06064056 -0.11361335 -0.01560404  0.0335197   0.19779573 -0.00643344
  0.22183537 -0.22460902  0.0513732  -0.07903808  0.04093904 -0.00644936
  0.0055754  -0.14519435]

As you can see the vector length is 50. We have successfully built a Doc2Vec model that we can use to vectorize any document or paragraph.

Here is the code block for the complete code used in this article.

!pip install gensim==4.0.1
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

common_texts

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
documents

model = Doc2Vec(documents, vector_size=5, min_count=1, workers=4, epochs = 40)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

model.vector_size
len(model.dv)
len(model.wv.index_to_key)
model.wv.key_to_index

vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
vector

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, alpha=0.3, min_alpha=0.05, dm_concat=1)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

Final Thoughts

In this article, we have covered what is Doc2Vec, how it differs from Word2Vec, ways to train a Doc2Vec and finally we built a custom Doc2Vec and trained it on our own dataset.

Thanks for reading.