FastText in NLP

In previous articles, we have discussed and built models for word embeddings and for document representations. The models we had trained were Word2Vec models and Doc2Vec models. In this article, we discuss another approach to develop embeddings called FastText.

Intro to FastText

FastText is an open-source and free library provided by the Facebook AI Research (FAIR) team. It is a model for learning word embeddings. FastText was proposed by Bojanowski et al., researchers from Facebook.

If you recall, when discussing word embeddings we had seen that there are two ways to train the model. One is called the skip-gram model and another is called the CBOW (Continuous Bag of Words) model. So, FastText was initially developed and built on top of the skip-gram method but now supports both skip-gram and CBOW methods.

Uses of FastText:

  • Very useful for finding semantic similarities
  • Large datasets can be trained in minutes
  • Can be used for the purpose of text classification.

How FastText vectorizes text?

As we had seen in the Word2Vec model the model generates a vector for each word in the vocabulary. But when we test the trained model on a new sentence and if the sentence contains words that were not present in the vocabulary then the model doesn’t know how to handle this situation.

So, we say that in Word2Vec models and the Doc2Vec models we had seen previously, these models rely heavily on the vocabulary that they had been trained on. Remember that the essence of training such an embedding model is that first, we want to train it on a known corpus and then we test the model on new text data.

As Word2Vec and Doc2Vec models rely on the vocabulary they had been trained on if the new text data that we want to vectorized contains words that were not previously present in the training vocabulary then these models fail to vectorized the unseen words accurately. FastText overcomes this problem.

FastText does this by vectorizing each word as a combination of character n-grams. The keyword to remember when working with FastText is character n-grams. If you don’t know what n-grams are, they are a number of words taken into consideration when working with text.

N Grams Example

Extending the same concept further we can also understand what character n-grams mean. Just like n-grams work for words, character n-grams work for characters. We check the two- and three-character n-grams for the word “language”.

la, lan, an, ang, ng, ngu, gu, gua, ua, uag, ag, age, ge

As you can see these are character n-grams. Basically, they are the same n-grams concept but we apply them to single characters instead of words. So, FastText will generate representations for such character n-grams and in turn, these will add up to form the embeddings of a complete word. How is this helpful?

Suppose we train the FastText model on a corpus and the model is now familiar with a vocabulary. If we try to generate embeddings for a word that was absent in the vocabulary during training, the FastText model would still be able to generate embeddings for the unseen word as the n-grams information of the word would be present in the vocabulary and the model has already captured that information.

Thus the biggest advantage of using FastText over other models such as Word2Vec is that FastText can generate embeddings for sentences with words not present in the training vocabulary with the help of character n-grams whereas other models fail to do so.

Building a FastText model

Since we have been working with the Gensim model so far and it has given us good results, we will continue working with Gensim. The library provides a way to build our FastText model. We will be using the inbuilt dataset offered by the Gensim library.

We begin by importing the necessary libraries and datasets. On the first line, we install the gensim model. On the next line, we import the FastText model provided by Gensim. On the last line, we import the inbuilt dataset.

!pip install gensim==4.0.1
from gensim.models import FastText
from gensim.test.utils import common_texts

After importing, we print out the dataset that we will be working with. To do this run the code below.


# Output:
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

This is the same dataset we had used in building the Doc2Vec model.

We now build a basic FastText model. We set the vector_size equals 5 so the dimensions of our output vector would be 5. We set the window parameter to equals 3, so this is the sliding window that will define the neighboring words to take as context words.

The min_count parameter is set to 1 so words repeated at least once in the data will be added to the vocabulary.

model = FastText(vector_size=5, window=3, min_count=1)

As we run the model code, we have now defined the model and we can apply it to the data now. In the code snippet below, on the first line, we apply the model to the data and build our vocabulary. On the last line, we train this model for 10 iterations on the dataset we had imported.

model.train(common_texts, total_examples=len(common_texts), epochs=10)

Now the model is trained on this sample dataset of ours. So it is ready to be applied to new data. But first, let us check the vocabulary that our model created. Do that by running the following code.


We check the output.


The output above shows the vocabulary our model created. You can observe all the unique words used in our sample dataset.

We now check the embeddings of the word “human” that was present in our vocabulary.


# Output:
array([-0.03166138,  0.0232673 ,  0.01241681,  0.00036033,  0.02841444], dtype=float32)

As we can see in the output, the word is represented using a vector with five elements in it. These elements are of the floating-point type.

Our model has a most_similar feature that allows us to check the most similar word to a given input relationship. Note that the words we are inputting here are present in our vocabulary.

model.wv.most_similar(positive=['computer', 'interface'], negative=['human'])

We now check the output of the above code.

[('user', 0.7968782186508179),
 ('system', 0.17462214827537537),
 ('response', 0.10433417558670044),
 ('survey', 0.009605271741747856),
 ('trees', -0.0764053612947464),
 ('time', -0.13300469517707825),
 ('minors', -0.1392730176448822),
 ('eps', -0.2409365326166153),
 ('graph', -0.29175299406051636)]

As we can observe that all the words are present in our dictionary. And the most similar word matching the given relation is “user”.

Now as we had discussed in the theory part, we use character n-grams to represent each embedding. We do this with the help of min_n and max_n parameters. min_n defines the minimum character n-gram and max_n represents the maximum character n-gram.

In the code snippet below, we set the character n-gram range from 1 to 5. So characters n-grams in this range will be taken into consideration when generating embeddings.

model = FastText(vector_size=5, window=3, min_count=1, min_n=1, max_n=5)

We train the model for 10 iterations on the same dataset.

model.train(common_texts, total_examples=len(common_texts), epochs=10)

Now comes the interesting part. Earlier we had seen that the FastText model can generate good embeddings for words that are not present in the vocabulary. So we now test the model on a word that was not present in the vocabulary.


# Output:
array([ 0.01833103, -0.02146882,  0.00600104, -0.03445043, -0.01658661], dtype=float32)

As we can see, the word we have passed to this model is not present in the vocabulary we had trained it on. Still, the model is generating embeddings. Now, we check how good these embeddings are. We check out a relationship.

In the code snippet below, we pass a relationship to find the most similar of these words. The word “rubber” is not present in our vocabulary.

model.wv.most_similar(positive=['computer', 'human'], negative=['rubber'])

We now observe the output of the above code.

[('trees', 0.795038104057312),
 ('eps', 0.7793108820915222),
 ('minors', 0.24405993521213531),
 ('time', 0.16231966018676758),
 ('user', -0.04820769280195236),
 ('graph', -0.15672095119953156),
 ('survey', -0.20417729020118713),
 ('interface', -0.392148494720459),
 ('response', -0.6897363662719727),
 ('system', -0.8435081243515015)]

As we can see, the most similar word to rubber is trees which is not bad for an unseen word. This shows that the FastText model performs great even on a small dataset like ours

The complete code used in this article is given below for quick reference.

!pip install gensim==4.0.1
from gensim.models import FastText
from gensim.test.utils import common_texts


model = FastText(vector_size=5, window=3, min_count=1)
model.train(common_texts, total_examples=len(common_texts), epochs=10)

model.wv.most_similar(positive=['computer', 'interface'], negative=['human'])

model = FastText(vector_size=5, window=3, min_count=1, min_n=1, max_n=5)
model.train(common_texts, total_examples=len(common_texts), epochs=10)

model.wv.most_similar(positive=['computer', 'human'], negative=['rubber'])

Final thoughts

In this article, we discussed the shortcomings of other embedding models, we were introduced to the FastText model, how it works and finally, we saw how to build a FastText model. We trained this model on a dataset and tested it on unseen data to check its performance which was great.

Thanks for reading.