The Naive Bayes algorithm for NLP

In the previous article on Machine Learning, we had discussed that two ML algorithms most commonly used in Natural Language Processing and are Naive Bayes and SVM. In this article, we will focus on the Naive Bayes algorithm and its implementation.

Table of Contents

1. Intro to the Bayes theorem

1.1. An example to understand the Bayes theorem

2. The Naive Bayes algorithm

2.1.1. What is so “naive” about Naive Bayes?

3. Using the Naive Bayes algorithm to perform sentiment analysis

3.1.1.1. Sentiment analysis is used in

4. Final thoughts

Intro to the Bayes theorem

The Naive Bayes algorithm is based on the Bayes theorem. So it is essential that we first get a good understanding of the Bayes theorem as it will help us to know how the Naive Bayes algorithm actually works.

The Bayes theorem is a mathematical formula used for calculating conditional probabilities. As you know that a probability is a chance of an event happening. So, if there is a chance that an event will occur we call it its probability. For example, finding the probability of an event of rolling two dice and getting two fives.

But, conditional probability is finding the probability of an event when another event has already occurred.

For example, say I want to find the probability that it will rain today, then I’m basically just finding the probability that it rains today. But, if I want to find the probability that it will rain today given that it’s a little too humid, then I’m finding the conditional probability that it’ll rain today.

The above equation is of the Bayes theorem. Let us look at the different parts of it by breaking it down.

Suppose we have two events A and B. We want to find the probability of event A occurring given that event B has already occurred. On the L.H.S or left-hand side of the equation, we have the conditional probability of event A occurring given that event B has already occurred.

This conditional probability on the L.H.S is also called the posterior probability. Coming to the R.H.S or right-hand side of the equation we have some terms in the numerator and denominator. The numerator itself consists of two terms.

The first term of the numerator is the conditional probability of event B occurring given event A has already occurred. The second term in the numerator is the probability of only event A occurring. This is called the prior probability. The term in the denominator is the probability of event B occurring.

An example to understand the Bayes theorem

Let us try to apply the formula discussed to a situation that would help us clearly understand the Bayes theorem.

We feel that the temperature is dropping and the humidity in the air is increasing, so is it going to rain? Let’s use Bayes theorem to find out. Here event A is the probability that it would rain and event B is that temperature and humidity vary. Ideally, the temperature should be decreasing and humidity should be increasing for a higher chance of rain.

We substitute the values in the Bayes theorem formula above. This is shown in the image above. We want to know the probability that it would rain given that the temperature and humidity vary. This is given in the L.H.S of the equation.

On the R.H.S, in the numerator, we first have the probability that temperature and humidity vary given that it is raining. So this is the probability that temperature and humidity change whenever it rains and we can get this data from a meteorological department.

The second term in the numerator is the probability that it would rain. Note that this is not a conditional probability, we are just taking the probability that it would rain on any given day. In the denominator, we have the probability that the temperature and humidity vary on any given day.

When we substitute these values in the equation, we would get the probability that it would rain given the temperature and humidity vary. Since we couldn’t directly find out if it would rain based on these conditions the Bayes theorem helps us find it.

We can say that the Bayes theorem and hence the Naive Bayes algorithm helps us find the posterior probability from the prior probabilities and this is exactly why the Naive Bayes algorithm does such a good job at classification problems. Hence, the Naive Bayes algorithm is used as a classifier.

The Naive Bayes algorithm

Now that we have seen what the Bayes theorem is and we also understood it with an example, we now focus on the Naive Bayes algorithm which is a popular classification algorithm

As we have seen, the Naive Bayes algorithm is based on the Bayes theorem. It is used mainly for classification tasks. In classification, we train the model on a labeled dataset i.e. we teach the model what each class belongs to and then the model learns and classifies or labels a new unseen dataset.

Just as we had seen in the Bayes theorem, the Naive Bayes algorithm computes the posterior probability i.e. classifying a new data point using the Bayes theorem. So to classify new data, the algorithm learns the prior and other probabilities and by utilizing the Bayes theorem it tries to label the unseen data.

What is so “naive” about Naive Bayes?

You must be thinking, that since this algorithm is based on the Bayes theorem, it contains the word “Bayes” in it, but why is the word “naive” present in it? To answer this question we first have to look at how joint probabilities are calculated.

A joint probability is nothing but the probability of two events occurring simultaneously. As an example, take the probability of getting two fives when two dice are rolled. Here comes the interesting part. There are two formulas for calculating the joint probability.

The formula to calculate joint probabilities when

The two variables are independent is P(x, y) = P(x) P(y)
The two variables are dependent is P(x, y) = P(x/y) P(y)

So, the Naive Bayes model treats all its features or variables as independent of each other. It uses the first formula given above to calculate joint probabilities which are required to solve the Bayes theorem. If you think about it, this assumption is actually pretty naive because it’s unlikely for all features in a dataset to be independent totally.

Take an example of the price of a house. We know that location, vicinity, floor area, etc all have an impact on the house price and are thus dependent on each other but the Naive Bayes model treats all features as independent and hence it is called “naive”. Nonetheless, it is an excellent classifier.

Using the Naive Bayes algorithm to perform sentiment analysis

Sentiment analysis is finding the polarity of a document. It is a type of algorithm that helps us judge the tone of a document, i.e. whether it is positive, negative, or neutral. Sentiment analysis is also called opinion mining or polarity detection.

For example, if we have a business and are interested in knowing how customers feel about a certain product or campaign, we can access the related Twitter data and with the help of sentiment analysis tools, we can see if a majority of the tweets are positive, negative or neutral.

There is no doubt that sentiment analysis is gaining popularity in the industry as it allows organizations to mine the opinions of a large group of users or potential customers in a cost-efficient way, just as we had discussed above.

Sentiment analysis is used in

Advertisement campaings
Political campaings
Stock analysis and more

The dataset we will be using to build a sentiment analyzer can be found at http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences. Once on the webpage, click the download data folder and download the zip file. After the download is complete, extract the files and you are good to go.

We begin with the imports.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

If you are on colab, you can mount your drive using the code given below.

from google.colab import drive
drive.mount('/content/drive')

Set the path to where you have stored the file on your system.

dataset_path = "/content/drive/MyDrive/Datasets/amazon_cells_labelled.txt"

We now read the dataset using the pandas read_csv function. This function helps us to read data efficiently from a CSV file into a dataframe. We are using the tab as a separator and we do not set a header.

data = pd.read_csv(dataset_path, sep="\t", header=None)

After we have read the dataset, let’s check the first 10 rows using the head method.

data.head(10)

As you can see, this is how the data looks. Note that each review has a label 0 or 1. 0 means it’s a negative sentiment and 1 means it’s a positive sentiment.

We want to separate the text and the labels so that we can work with the text easily and get it ready for modeling. Use the code below to extract the text.

X = data.iloc[:,0]
X

# Output:
0      So there is no way for me to plug it in here i...
1                            Good case, Excellent value.
2                                 Great for the jawbone.
3      Tied to charger for conversations lasting more...
4                                      The mic is great.
                             ...

Use the code below to extract the sentiments.

y = data.iloc[:,-1]
y

# Output:
0      0
1      1
2      1
3      0
4      1
      ..

Now that we have segregated text and sentiments, we will prepare the text for further tasks.

We will vectorize the text using the count vectorizer. We also filter the stopwords in this step. We then fit this vectorizer onto our text data.

vectorizer = CountVectorizer(stop_words='english')
X_vec = vectorizer.fit_transform(X)

After running the above step, we get a sparse matrix which is the result of applying the sparse matrix to our dataset. We now convert it into a dense matrix and check it out.

X_vec = X_vec.todense()
X_vec

# Output:
matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

We now apply TF-IDF transformation to the data. Note that we are applying TF-IDF transformation to the Bag of Words model we created in the previous step. We would get a better representation of the data.

tfidf = TfidfTransformer() # by default applies "l2" normalization
X_tfidf = tfidf.fit_transform(X_vec)
X_tfidf = X_tfidf.todense()
X_tfidf

# Output:
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

Now, we need to split the dataset into train and test sets. We do this because we want to evaluate the performance of our trained model. Remember that we first train the model i.e. let it learn on the training set and then we test it on an unseen dataset i.e. the test dataset. This is called cross-validation.

To split the data we use the train_test_split function. We set the test_size to equals 0.25 indicating that 75% of the data will be split for training and 25% for testing.

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, 
                                                    test_size = 0.25, 
                                                    random_state = 0)

We now instantiate the Naive Bayes model and fit the model on the training dataset. This is where our model learns.

clf = MultinomialNB()
clf.fit(X_train, y_train)

After the model is trained, we use it to predict the sentiments of data that is unseen by the model i.e. the test data. We save these predicted values in y_pred.

y_pred = clf.predict(X_test)

We can judge the performance of our model by computing the confusion matrix. This matrix is an indicator of how well the model performed on this task of sentiment analysis. Basically, we are comparing the predicted and actual or true values.

confusion_matrix(y_test, y_pred)

# Output:
array([[ 87,  33],
       [ 20, 110]])

For the scikit-learns confusion matrix,

The values in vertical axis represent the actual values
The values in horizontal axis represent the predicted values
The total number of correct predictions are obtained by summing the left diagonal.

So, our model predicted 107 (87 + 20) values as having a negative sentiment (0) out of which 87 were correctly predicted and 20 were incorrectly predicted. Our model also predicted 143 (33 + 110) values as having a positive sentiment (1) out of which 110 were correctly predicted and 33 were incorrectly predicted.

The accuracy of our model is 78.8 % (summing values in left diagonal/all values in the matrix).

The complete code used in this article is given below for quick reference.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

from google.colab import drive
drive.mount('/content/drive')

dataset_path = "/content/drive/MyDrive/Datasets/amazon_cells_labelled.txt"
data = pd.read_csv(dataset_path, sep="\t", header=None)
data.head(10)

X = data.iloc[:,0]
X

y = data.iloc[:,-1]
y

vectorizer = CountVectorizer(stop_words='english')
X_vec = vectorizer.fit_transform(X)
X_vec = X_vec.todense()
X_vec

tfidf = TfidfTransformer() # by default applies "l2" normalization
X_tfidf = tfidf.fit_transform(X_vec)
X_tfidf = X_tfidf.todense()
X_tfidf

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, 
                                                    test_size = 0.25, 
                                                    random_state = 0)

clf = MultinomialNB()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

confusion_matrix(y_test, y_pred)

Final thoughts

In this article, we were first introduced to the Bayes theorem, then to the Naive Bayes model and finally, we built a sentiment analysis tool with the help of the Naive Bayes algorithm which gave us a pretty decent accuracy score.
Thanks for reading.