The Bag-of-Words or BoW approach is a very fundamental topic in Natural language Processing. It is a way to represent our text into numbers.
In the introductory section of this NLP series, we had discussed that computers are not designed to understand human languages naturally. In order for the computer to conveniently and seamlessly work with text data, we would have to convert our text to numbers.
Once our text is converted into numbers, computers can work with it as well as apply complex mathematical computations to that data.
Till now in this NLP series, we have also seen some preprocessing steps such as tokenization, stemming, lemmatization, stopword removal, POS tagging, NER, etc. These steps have helped us prepare our text data so that it can undergo further preprocessing. Now, the next step is to convert our text into numbers. How do we do that?
For Machine Learning algorithms to understand our text we have to build mathematical representations of text. One way to do this is by using the Bag-of-Words approach. Before diving into the BoW approach we have to revisit two important mathematical concepts.
Primer on Vectors and Matrices
I’m sure you must be familiar with vectors and matrices from high-school mathematics. Vectors and matrices are mathematical data structures that we mostly use to represent our text data in NLP. After converting our text data into a mathematical format, it is passed to the computer in the form of vectors and matrices.
Vectors are a one-dimensional array of numbers in which each number could be identified by its respective indices. Vectors are typically represented as a column enclosed in square brackets.
An example of a vector is
In the above example, the vector x is a single-dimensional array with 3 elements x1, x2, and x3. This is similar to a single-dimensional NumPy array. When we convert a text to its mathematical form we say we have vectorized the text data.
Matrices are extensions of arrays. A matrix is nothing but a rectangular array of numbers wherein each number is identified by two indices. Matrices are also indicated using square brackets like vectors, but one major difference is that matrics contain rows and columns while vectors don’t.
An example of a matrix is given as
In the above example, matrix A consists of 3 rows and 2 columns.
Okay now that we have revisited the core concepts of vectors and matrices, the question is ‘how are they useful in NLP?’. Think of the corpus as a whole matrix. A corpus consists of different documents. So, each row of this matrix corresponds to a document vector. This concept will be clearer when we solve this practically.
How does the BoW approach work?
A very simple intuition of BoW is to think of this approach as just taking the frequency or count of each word in the document. So what the BoW model does is that it takes a document, finds the frequency or counts of words in that document, and creates a new representation based on that count.
A very important term to take note of here is the term vocabulary. Vocabulary in terms of NLP means a unique set of words in the corpus. So in a corpus, a vocabulary would be a set of all the words in that corpus repeated only once.
So let’s take an example to understand how the BoW approach works. Suppose we have a text document and it is a part of a corpus that contains other documents as well. Our document may consist of many sentences. So, we create a vocabulary (set of unique words) out of our document and now each sentence of our document is represented as a vector with elements corresponding to the frequency of that word.
According to the BoW approach, the length of each vector would be equal to the number of unique words in our document. The resulting vector would contain elements that are nothing but the count of unique words in our document.
So in this way, a text document is converted into a mathematical form i.e. to a vector by taking the counts of each word in the document. Doing this practically will solidify your understanding.
BoW using CountVectorizer from SKlearn
CountVectorizer is a useful tool provided by the scikit-learn or Sklearn library in Python. It helps us implement the BoW approach seamlessly.
First, we import the necessary libraries and packages. This is shown in the code snippet below. On the last line, we import the CountVectorizer
.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
After importing the required libraries, we define a sample text for work with. I have chosen the text from the Wikipedia page of Tesla company. Assign it to a variable, we have assigned it to text
.
text = (''' Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California.
Tesla's current products include electric cars, battery energy storage from
home to grid-scale, solar panels and solar roof tiles, as well as other related products and services.
''')
We now convert the text to a Pandas Series object. We do this so that it’s easier to work with the text.
After that, we initialize CountVectorizer
we had previously imported. On the last line, we use a function of the vectorizer called fit_transform
on our text data. This will convert the format of our text to a mathematical format.
corpus = pd.Series(text)
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)
A quick check of the variable bow_matrix
tells us that it is a sparse matrix of the NumPy array type. A sparse matrix is a type of matrix that has a majority of zero values. And it is of a NumPy type specifies that it is a vector as NumPy arrays are one-dimensional.
bow_matrix
Output:
<1x36 sparse matrix of type class 'numpy.int64' with 36 stored elements in Compressed Sparse Row format>
Our vocabulary for the document is ready. We check the unique (non-repeated) words in our document by running the following code.
feature_names = vectorizer.get_feature_names()
feature_names
The feature names are as follows.
Output:
['alto',
'american',
'an',
'and',
'as',
'based',
'battery',
'california',
'cars',
'clean',
'company',
By observing the output you can see that these feature names are the tokens from our sample text and none of them are repeated. This is our vocabulary. What’s the size of our vocabulary?
len(feature_names)
Output:
36
The size of our vocabulary is 36. This means there are 36 unique words in our document.
We now look at the mathematical representation of our text. To do this first we convert the sparse matrix to a dense matrix and then print it out.
feature_array = bow_matrix.toarray()
feature_array
Output:
array([[1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1]])
The output above shows the frequency of the 36 previous words in our vocabulary. So this is how our text is converted into the mathematical form for computers to work with.
We print the shape of the bow_matrix just to confirm that there are 36 elements in our vector.
print(bow_matrix.toarray().shape)
Output:
(1, 36)
One important thing to note in this example we have just gone through is that we have directly converted the text into its mathematical representation using the BoW approach. We can also apply some preprocessing steps before converting the text to reduce the size of our vocabulary.
Limitations of the BoW method.
- Since the model takes into account the counts of terms in the document, it may work well for documents with a limited vocabulary but won’t perform well for large documents.
- Some words occur rarely in the document but carry significant meaning. The BoW model will not be able to acknowledge such important words as its frequency would be low.
- As the size of the corpus increases the size of the document vocabulary would also increase and this would give rise to the problem of “Curse of Dimensionality”.
Final Thoughts
In this article, we have visited one of the building blocks of the NLP pipeline i.e. vectorization using the Bag-of_words model. We saw how this approach works and we performed it practically using the scikit-learn package. Thanks for reading.