×

Using fastText to build a Spelling Corrector

In the previous article on fastText, we had seen how to build a fastText model. In this article, we will use the same concept of the fastText model and build a spelling corrector.

A spelling corrector

You must have used a spelling corrector online or within an application. A spelling corrector is a very useful application of Natural Language Processing.

A spelling corrector is basically a software that will notify you of any spelling errors in the words as typed. This helps us to prevent any spelling mistakes as the text could be an important email and a spelling error would show that you were not focused on the task.

We are so busy in our lives that we text or email hurriedly. We don’t even read what we are typing, because we just want to get over with the mail and do some other work. But in this case, we are prone to making a lot of errors and a good spelling correction software could save the day.

Hence, a lot of research and development has been going on to build better spelling correctors and algorithms.

Does spelling matter?

Let us look at a few reasons why correct spelling matters.

  • When your text is correctly spelled, it is much easier to read. Suppose a person is reading the text you just sent and the person finds a spelling error. So now he/she is distracted by the error and this interrupts the person from actually focusing on the message you wanted to give.
  • When there are many typing errors (typos) in your text, it makes you look less intelligent. Yes you read that right. A person may perceive you as less intelligent if have many spelling errors as the person may conclude that you don’t actually know the spelling of those words.
  • Wrong spelled words may completely change the meaning of your text. When you sent a text or mail and it contains many typing errors, it may actually change the meaning of what you are trying to say. Even just a small error in a sentence could change the meaning of it unkowingly.
    • Take the examples of the sentence “I feel good today”. I we just change one letter in the sentence, the meaning changes to “I fell good today”.
    • Let’s take another sentence, “I love you too”. Again if we change one word the whole meaning changes. “I love you two”.
  • When your text is spelling free it looks more professional and better. If what you type and send is spelling free it will appear to be more professional and people will respect that. On the other hand, if your text is full of spelling errors, people will perceive you as less professional.

Keeping the above points in mind, we will create a spelling checker that uses the fastText technology.

Building a spelling checker

The data we will be using is a dataset from Kaggle. You can read more about the dataset by clicking the link – https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge. The data we will be working with is taken from this Kaggle competition. You can download and use the data by clicking the link given here.

Once you have downloaded the data, we proceed by doing the necessary imports. The code block given below contains the required imports and libraries to build the spelling checker.

!pip install gensim==4.0.1
import nltk
import re
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from gensim.models import FastText
import io
import collections

After we have imported the libraries, we proceed to read the data.

In the code snippet below,

  • We first define two empty lists. One list is to store the words in our data and another one is to store the data.
  • On the third line, pass the relevant path on your system where your data is. We open and read the file, then run a for loop.
  • On the fourth line, we begin the for loop and strip the data on the line i.e. remove whitespace if any.
  • On the sixth line, we append the data to the list we had instantiated before.
  • On the last line, we spilt the sentences and extend them to our list.

Note that append differs from extend such that append only adds an object to the end of the list whereas extend goes over the elements of an iterable and adds each element separately.

words = []
data = []
with io.open('/content/drive/MyDrive/Datasets/comments.txt', 'r') as file:
    for entry in file:
        entry = entry.strip()
        data.append(entry)
        words.extend(entry.split())

We now check the data.

data[:10]

# Output:
['"Explanation',
 'Why the edits made under my username Hardcore Metallica Fan were reverted? They weren\'t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don\'t remove the template from the talk page since I\'m retired now.89.205.38.27"',
 "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
...

Now, we check the first ten elements of the words list.

words[:10]

# Output:
['"Explanation',
 'Why',
 'the',
 'edits',
 'made',
 'under',
 'my',
 'username',
 'Hardcore',
 'Metallica']

Now that we have checked the data, we will check the most common terms in our dataset. We do this by using the code given below. On the first line, we define an empty list. On the second line, we use the Counter from collections to get a count of terms in the data.

On the last line, we print the top ten most common terms.

unique_words = []
unique_words = collections.Counter(words)
unique_words.most_common(10)

# Output
[('the', 445892),
 ('to', 288753),
 ('of', 219279),
 ('and', 207335),
 ('a', 201765),
 ('I', 182618),
 ('is', 164602),
 ('you', 157025),
 ('that', 140495),
 ('in', 130244)]

As you can see, most of the words are stopwords i.e. words that do not contribute much to the meaning of the sentence. We will have to clean our data to filter out the stopwords.

We now have to preprocess the data for our model to actually be effective. To preprocess, we define two functions. One will remove punctuations and another will remove stopwords.

Removing punctuations and general cleaning of the text.

The function takes the corpus as an argument. On the second line, we initialize an empty list. On the next line, we start a new for loop. Within the for loop, we initialize another list and start another for loop.

So basically, we are looping through every row in our corpus, and then through every word in the row. We then use regular expressions to replace anything other than the alphanumeric character with whitespace. We also convert the word completely into lowercase characters.

Finally, we append the words to the list and join back the sentences together.

def text_clean(corpus):
    cleaned_corpus = []
    for row in corpus:
        qs = []
        for word in row.split():
            p1 = re.sub(pattern='[^a-zA-Z0-9]',repl=' ',string=word)
            p1 = p1.lower()
            qs.append(p1)
        cleaned_corpus.append(' '.join(qs))
    return cleaned_corpus

Preprocessing to remove stopwords.

Take a look at the code snippet below. The function takes an argument i.e. the corpus we want to clean. We define a list of words. These words contain words that may contain important information, hence we don’t remove them. On the third line, we load the stopwords provided by NLTK.

We generate a set of these English language stopwords i.e. unique words in the stopwords list and assign them to stop. We then start a for loop and remove these wh_words from the set of stopwords. Now, on the sixth line, we use a list comprehension, and iterating through each token in each row of the corpus we filter out the stopwords.

def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

Now that we have defined our functions for preprocessing, we apply them to the data. We first apply the text_clean function to our data and then we remove the stopwords from our data. Finally, we join them all together.

corpus = text_clean(data)
cleaned_corpus = stopwords_removal(corpus)
final_cleaned_corpus = [' '.join(x) for x in cleaned_corpus]

We now get a glimpse of this preprocessed dataset.

final_cleaned_corpus[:5]

# Output:
['explanation',
 'why edits made username hardcore metallica fan reverted vandalisms closure gas voted new york dolls fac please remove template talk page since retired 89 205 38 27',
 'aww matches background colour seemingly stuck thanks talk 21 51 january 11 2016 utc',
 'hey man really trying edit war guy constantly removing relevant information talking edits instead talk page seems care formatting actual info',
 '']

Now that we have preprocessed and cleaned our dataset, we have one final thing to do before we feed the data to the model. We have to format our dataset correctly in a way that is accepted by the fastText model.

In the code snippet below, we are first defining an empty list. We then loop over our data, take the lines containing text data, tokenize them and finally add them to the empty list we had initialized before.

preprocessed_data = []
for line in final_cleaned_corpus:
    if line != "":
        preprocessed_data.append(line.split())

After we have tokenized each sentence in our data, we are ready to build the model.

We build the model by defining, instantiating the model, and passing in values for parameters. We have set the vector_size to equals 300 so the dimension of each vector would be 300. We have set the window to equals 3. This is the parameter that defines the sliding window the capture context words.

The next parameter is the min_count parameter that adds words to the vocabulary when repeated at least once here. The min_n and max_n parameters are what define the character n-gram range essential to the fastText model. If you don’t know or remember these terms I suggest you read my previous article on fastText.

Remember, you can play around with these parameters and choose the one that best suits your needs. These parameters are also hyperparameters for the same.

model = FastText(vector_size=300, window=3, min_count=1, min_n=1, max_n=5)

Next, we build the vocabulary from our dataset. Do this by running the following line of code.

model.build_vocab(preprocessed_data)

After we build the vocabulary, we check the length of our vocabulary. These are the number of unique words in our vocabulary.

len(model.wv.index_to_key)

# Output:
182228

We see that there are a lot of words in our vocabulary.

Now, we train the model on this dataset. You can use the following code to do this. The train method takes three arguments. First is the dataset we cleaned and processed, second is the length of this dataset and the third one is the epochs parameter.

Here we have set the epochs to equals 10 but this is a hyperparameter and you can experiment with different values. Train the model using the code below.

model.train(preprocessed_data, total_examples=len(preprocessed_data), epochs=10)

The training may take some time to complete.

Once the training is done, we can now enter wrong spellings and the model will try to give us the correct one.

We take some examples of misspelled words. First, we take the word “prspctive”. We check what the model outputs.

model.wv.most_similar('prspctive', topn=5)

# Output:
[('prospective', 0.9340896010398865),
 ('proactive', 0.9336438775062561),
 ('persepctive', 0.9228048324584961),
 ('predictive', 0.9170488119125366),
 ('pewrspective', 0.9166929721832275)]

We see that the model outputs these words with amazing accuracy. We take another word as an example. The word is “comuter”. Let’s check what the model outputs.

model.wv.most_similar('comuter', topn=5)

# Output:
[('commuter', 0.9519362449645996),
 ('computer', 0.9005268812179565),
 ('comptuer', 0.8920755982398987),
 ('computerjoe', 0.8904815912246704),
 ('couter', 0.8837856650352478)]

We see that the model outputs the correct suggestions.

The complete code used in this article is given below for quick reference.

!pip install gensim==4.0.1
import nltk
import re
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import FastText
import io
import collections

words = []
data = []
with io.open('/content/drive/MyDrive/Datasets/comments.txt', 'r') as file:
    for entry in file:
        entry = entry.strip()
        data.append(entry)
        words.extend(entry.split())

data[:10]

words[:10]

unique_words = []
unique_words = collections.Counter(words)
unique_words.most_common(10)

def text_clean(corpus):
    cleaned_corpus = []
    for row in corpus:
        qs = []
        for word in row.split():
            p1 = re.sub(pattern='[^a-zA-Z0-9]',repl=' ',string=word)
            p1 = p1.lower()
            qs.append(p1)
        cleaned_corpus.append(' '.join(qs))
    return cleaned_corpus

def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

corpus = text_clean(data)
cleaned_corpus = stopwords_removal(corpus)
final_cleaned_corpus = [' '.join(x) for x in cleaned_corpus]

final_cleaned_corpus[:5]

preprocessed_data = []
for line in final_cleaned_corpus:
    if line != "":
        preprocessed_data.append(line.split())

model = FastText(vector_size=500, window=3, min_count=1, min_n=1, max_n=5)
model.build_vocab(preprocessed_data)
len(model.wv.index_to_key)

model.train(preprocessed_data, total_examples=len(preprocessed_data), epochs=10)
model.wv.most_similar('prspctive', topn=5)
model.wv.most_similar('comuter', topn=5)

Final Thoughts

In this article, we saw what a spelling checker does, why correct spellings are necessary and we then saw how to build a spelling checker using a fastText model.

Thanks for reading.