×

Convolutional Neural Networks (CNNs) for NLP

In the previous articles, we have seen how deep learning and specifically how an ANN can be used for the purpose of NLP. Now, we advance towards another deep learning architecture called Convolutional Neural Networks (CNNs) and see how we can utilize it to solve NLP problems.

Convolutional Neural Networks

You must have heard of CNNs and probably in the context of Computer Vision and Image Recognition. No doubt that CNN’s perform amazingly well for these tasks but we can apply them to NLP tasks as well.

Convolutional Neural Networks are an advanced form of Neural Networks than what we had seen previously. As the name suggests, these networks rely heavily on a mathematical operation called the convolution operation. The CNN network tries to capture the spatial relationship in data.

Convolutional Neural Networks work well with images because images are made up of pixels and the spatial relationship between similar pixels can easily be found. This helps the model to make sense of the image. We will try to apply this concept to text and see how it works out. But first, let’s understand the convolution operation.

The convolution operation

The main operation on which the whole working of a CNN network is based is called the convolution operation. This operation is the backbone of the Convolutional Neural Network. So let’s understand what this operation is and how it is done.

The image below shows the complete convolutional operation. We’ll look at it step by step to understand how it works.

Convolution Op 2 1

We can think of the source pixel as a part of an image we are working with. As you can see, the image is made up of a number of pixels. We focus on the top-left corner of the image pixels. We focus on the green shaded area made up of 3*3 pixels with a center pixel value 6.

The 3*3 matrix labeled convolution filter is the filter we are applying to the image. This filter is also called a kernel. The kernel used here is the Sobel Gx kernel. You can see the values the kernel contains. You can also see the convolution operation given at the upper side of the image.

Now, the kernel essentially computes the element-wise dot product of the values (of the green shaded region and the kernel) and then sums them up. We are applying the filter or kernel on the source pixel values and computing element-wise multiplication before computing the summation. This is the convolutional operation.

The first value in the filter and green shaded pixels are -1 and 3 respectively. So, they are multiplied and we get -3. Then we compute the multiplication of the next elements and so on.

After all the element-wise calculations we summate the values. As we see, the destination pixel has a final value -3. This is the value we get after the convolution operation. The whole 3*3 source matrix is then shifted right and the values are computed once again.

This is basically how the convolutional operation in a CNN takes place.

Why do we need padding?

In the convolutional operation above, the filter that is applied to the image shifts one step right after each operation. As we move to the rightmost column, we encounter a problem. As you visualize this, you see that the leftmost column of our filter is never applied to the rightmost column of the image.

To elaborate, if you look at the kernel above, the leftmost column is [-1, -2, -1], and when the convolutional frame shifts to the end of the row, the rightmost column [3, 0, 1] are never applied to the kernel. This is the reason why we need padding.

Padding basically means adding zeros to the end of the image on any side so that the kernel operation is applied to all the columns. When we pad with zeros, we call it zero-padding. Thus the convolutional operation can be calculated even for the end-most columns and values.

As you can see in the image below, we have zero-padded the image. Only because we have zero-padded, we can get the value of the corner pixel as seen in the image.

Zero Padding 2

The benefits of zero-padding are that no value is under-sampled compared to others. It enables the filter to find patterns at all the places in the image, irrespective of their position.

What are strides in a CNN?

A stride is a parameter we can define for the model and hence it is a hyperparameter. So, a stride of one would mean that we are moving the kernel or filter just one pixel at a time. The stride parameter defines how the transition from one pixel to another should be captured.

When the stride is low, we capture more information from the data but when the stride is high we don’t capture that much information from the data. So strides basically define by how many units the kernel should be shifted to the right from the data.

Pooling in CNNs

Till now, we saw that the convolution operation is performed on the data. We get the results of the convolution operation. After convolution, we usually apply the pooling layers in CNN.

Pooling is mainly applied to minimize computation power while capturing the maximum information from the data. The pooling operation can be thought of as a downsampling process, where we downsample the data and keep only the relevant information from the data while reducing its size.

Pooling helps prevent overfitting. Pooling also provides a fixed-size output matrix irrespective of the input or filter size. This is a very helpful feature if the size of our input data is not consistent or if we apply kernels of different sizes.

Max pooling

This is the most common pooling technique. In max pooling, the maximum value across a filter is taken up. Max pooling is shown in the image below.

Max Pool

Here, we take a stride of 2. In the orange boxes, when we apply the filter, we take the maximum value across the filter. Here, 20 is the maximum value across the orange filter so it’s added to the pooling output.

Average pooling

In average pooling, we take the average of all the values across the applied filter. We calculate the average of all the values present in the applied filter. Average pooling is shown in the image below.

Average Pool

Here, when the orange filter is applied, we consider all the values present inside the filter, take their average and add it to the final pooling output.

Sum pooling

In sum pooling, we take the sum of all the values across the filter applied. When the filter is applied, all the values present in the filter are taken into consideration and summed up. This value is then added to the final pooling output. This is the least used pooling method.

The complete neural network

Till now we have seen how the CNN network applies the convolution operation with the help of filters. We have also seen some hyperparameters we can tune and we saw what pooling is how it helps. After all these steps the CNN provides the output to the ANN.

The CNN passes the information it extracts from the data to a feedforward neural network. The information that CNN extracts from its layers especially after the pooling layer is converted to a vector and passed to the feedforward neural network.

CNN for text data

Now we will see how we can use the CNN network to detect sarcasm in text. First, let’s see how the CNN model can be used while working with text data.

The image below shows the working of CNN for the textual classification task.

Cnn Text 2 1

In the image, we see that text data is fed as input to the CNN network. The inputs are word embeddings of the words and they are sent to the convolutional layers. The layers consist of multiple filters to perform the convolution operation.

This is then passed on to the max-pooling layer and finally to the fully connected layer. It is important to note that, unlike image data, text data has more of a sequential relationship. While working with text data using CNNs, we focus on the one-dimensional spatial relationship in the data.

Using CNN for sarcasm detection

Sarcasm detection means software that helps detects sarcasm in text. Sarcasm is very common today and a lot of text data from social media contains a lot of sarcasm which is difficult for the computer to process and understand.

We now see how a Convolutional Neural Network can be utilized to detect sarcasm. The dataset we will be using for this purpose can be downloaded by heading over to the link at https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection.

The dataset is open-sourced and contains news headlines that can be used for sarcasm detection.

We first begin with the imports. Follow the code below to import the necessary libraries and modules.

import pandas as pd
import numpy as np
import re
import json
import gensim
import math
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import KeyedVectors
import keras 
from keras.models import Sequential, Model 
from keras import layers
from keras.layers import Dense, Dropout, Conv1D, GlobalMaxPooling1D
import os

These are the modules and libraries we will be using. Now, let’s read the data. You can mount your drive with the help of the following code if you’re using colab.

from google.colab import drive
drive.mount('/content/drive')

# Output:
Mounted at /content/drive

Now, to read the data, we define a function as follows. In the function, we pass the file location as an argument. We then open the file in read mode and load the file that is in JSON format.

def parse_data(file):
    for l in open(file,'r'):
        yield json.loads(l)

We now call the function with the appropriate file path. Don’t forget to specify the path to the file in your system. We convert it to a list and on the next line, we convert it to a dataframe object.

data = list(parse_data('/content/drive/MyDrive/Datasets/Sarcasm_Headlines_Dataset_v2.json'))
df = pd.DataFrame(data)

Let’s get a glimpse of the data.

data.head()
Data Head 10

We didn’t need the article_link column. Let’s remove it using the pop method.

df.pop('article_link')

# Output:
0        https://www.theonion.com/thirtysomething-scien...
1        https://www.huffingtonpost.com/entry/donna-edw...
2        https://www.huffingtonpost.com/entry/eat-your-...
3        https://local.theonion.com/inclement-weather-p...
4        https://www.theonion.com/mother-comes-pretty-c...
                               ...                        
28614    https://www.theonion.com/jews-to-celebrate-ros...
28615    https://local.theonion.com/internal-affairs-in...
28616    https://www.huffingtonpost.com/entry/andrew-ah...
28617    https://www.theonion.com/mars-probe-destroyed-...
28618    https://www.theonion.com/dad-clarifies-this-no...
Name: article_link, Length: 28619, dtype: object

Now, let’s see what we have in our dataframe. We use the head method again.

df.head()
Data Head 11

We want to preprocess our data now.

In the code snippet below, we define a function to perform general cleaning. Inside the function, we first initialize an empty Pandas Series. Then, we run a for loop and iterate over each row in the corpus. With the help of a nested for loop, we also iterate over each word of each row in the corpus.

We apply a regular expression pattern to each word in the row, and convert the word to lowercase. We store these words in a list and join them together.

def text_clean(corpus):
    cleaned_corpus = pd.Series()
    for row in corpus:
        qs = []
        for word in row.split():
            p1 = re.sub(pattern='[^a-zA-Z]',repl=' ',string=word)
            p1 = p1.lower()
            qs.append(p1)
        cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
    return cleaned_corpus

In the code snippet below, we define our second preprocessing function for stopword removal. This function helps to remove stopwords from the data. In the function, we keep some wh-words to preserve important information. We then load the stopwords provided by NTLK and filter them from our data.

def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

The third function we define is for lemmatizing the text. We lemmatize the text i.e we convert the words into their base (lemma) form. In the function, we first initialize the lemmatizer, and then using list comprehension we lemmatize words to their corresponding lemma.

def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

Now, we call the above-defined function on our data. Refer to the code snippet below to perform preprocessing. First, we clean the data using the text_clean function. We pass the headline column as input which contains the sarcasm data.

Then, we remove stopwords using the stopwords_removal function. Next, we perform lemmatization using the lemmatize function. Lastly, we join all the text data together and we get the final_cleaned_corpus. We print the first 10 rows of this variable.

corpus = text_clean(df['headline'])
corpus = stopwords_removal(corpus)
corpus = lemmatize(corpus)
final_cleaned_corpus = [' '.join(x) for x in corpus]

final_cleaned_corpus[:10]
['thirtysomething scientists unveil doomsday clock hair loss',
 'dem rep totally nail congress fall short gender racial equality',
 'eat veggies 9 deliciously different recipes',
 'inclement weather prevent liar get work',
 'mother come pretty close use word stream correctly',
 'white inheritance',
 '5 ways file tax less stress',
 'richard branson global warm donation nearly much cost fail balloon trip',
 'shadow government get large meet marriott conference room b',
 'lot parent know scenario']

As you can see, the data is now cleaned and free of punctuation marks and symbols, it is free of stopwords and all the words are converted to their lemmas. We now want to vectorize this data i.e. we want to convert it into a mathematical representation.

We will make use of the pre-trained word2vec model that we had worked with previously. Use the code below to download the model and load it. In the code snippet below, on the first line, we download the model from a cloud server.

The second line is useful if you are working in Colab, here we change our working directory to the folder where to downloaded model is present. On the fourth line, we pass the path to our downloaded model and on the last line, we load the model.

!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

os.chdir("/root/input")
!ls -al

model_path = '/root/input/GoogleNews-vectors-negative300.bin.gz'
model = KeyedVectors.load_word2vec_format(model_path, binary=True)

So, now that our pre-trained model is loaded, we will use it to vectorize our data. We set two parameters. The first one is max_length and we set it to 10. This means if any of the headlines exceed 10 characters in length we will use only the data up to the 10 characters. We are setting a limit to use only the first 10 characters in each headline.

The second parameter is vector_size and these will be the dimensions of our vectors. Note that if the headline is very short i.e. less than the maximum length we had specified then the data will be padded so that all the vectors are of the same length.

MAX_LENGTH = 10
VECTOR_SIZE = 300

We now define a function to vectorize the data. This function basically takes our data and converts it to vector if the length is less than max_length and if the length is less than the max_length it pads the data.

def vectorize_data(data):
    
    vectors = []
    
    padding_vector = [0.0] * VECTOR_SIZE
    
    for i, data_point in enumerate(data):
        data_point_vectors = []
        count = 0
        
        tokens = data_point.split()
        
        for token in tokens:
            if count >= MAX_LENGTH:
                break
            if token in model.wv.vocab:
                data_point_vectors.append(model.wv[token])
            count = count + 1
        
        if len(data_point_vectors) < MAX_LENGTH:
            to_fill = MAX_LENGTH - len(data_point_vectors)
            for _ in range(to_fill):
                data_point_vectors.append(padding_vector)
        
        vectors.append(data_point_vectors)
        
    return vectors

We now apply the function to our preprocessed data.

vectorized_headlines = vectorize_data(final_cleaned_corpus)

Now, we will quickly validate that there are 10 vectors for each headline. To do this we can use the code below. The code will print output if any headline had more or less than 10 vectors.

for i, vec in enumerate(vectorized_headlines):
    if len(vec) != MAX_LENGTH:
        print(i)

When we run the code, we see that there is no output. This means our code is correct.

We will now split the data into train and test sets. To do this follow the steps given below. First, we get all the indices of the train set with 70% of the total indices.

train_div = math.floor(0.7 * len(vectorized_headlines))
train_div

We now use these indices and get our train and test sets.

X_train = vectorized_headlines[:train_div]
y_train = df['is_sarcastic'][:train_div]
X_test = vectorized_headlines[train_div:]
y_test = df['is_sarcastic'][train_div:]

Let’s print the size of each of them.

print('The size of X_train is:', len(X_train), '\nThe size of y_train is:', len(y_train),
      '\nThe size of X_test is:', len(X_test), '\nThe size of y_test is:', len(y_test))

# Output:
The size of X_train is: 20033 
The size of y_train is: 20033 
The size of X_test is: 8586 
The size of y_test is: 8586

We see that we have split the data into train and test sets. Before passing the data to CNN, we have to make sure that it is in the correct format for the model to work with. We use the reshape function provided by NumPy.

X_train = np.reshape(X_train, (len(X_train), MAX_LENGTH, VECTOR_SIZE))
X_test = np.reshape(X_test, (len(X_test), MAX_LENGTH, VECTOR_SIZE))
y_train = np.array(y_train)
y_test = np.array(y_test)

We now start building the Convolutional Neural Network. First, we define the model parameters. These parameters will be used for training the model. You can observe that for our CNN model we are using 20 filters, we set the size of a kernel to equals 10.

FILTERS=20
KERNEL_SIZE=10
HIDDEN_LAYER_1_NODES=50
HIDDEN_LAYER_2_NODES=25
DROPOUT_PROB=0.35
NUM_EPOCHS=50
BATCH_SIZE=100

Let’s use Keras to build the CNN and define the layers. Since we are working with text, we use a 1-dimensional convolution layer. You can see that we set the stride to equals 1. Notice the different layers we are adding. The convolution layer, the pooling layer, the fully connected layers, etc.

model = Sequential()

model.add(Conv1D(FILTERS,
                 KERNEL_SIZE,
                 padding='same',
                 strides=1,
                 activation='relu', 
                 input_shape = (MAX_LENGTH, VECTOR_SIZE)))
model.add(GlobalMaxPooling1D())
model.add(Dense(HIDDEN_LAYER_1_NODES, activation='relu'))
model.add(Dropout(DROPOUT_PROB))
model.add(Dense(HIDDEN_LAYER_2_NODES, activation='relu'))
model.add(Dropout(DROPOUT_PROB))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

# Output:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d (Conv1D)              (None, 10, 20)            60020     
_________________________________________________________________
global_max_pooling1d (Global (None, 20)                0         
_________________________________________________________________
dense (Dense)                (None, 50)                1050      
_________________________________________________________________
dropout (Dropout)            (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 25)                1275      
_________________________________________________________________
dropout_1 (Dropout)          (None, 25)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 26        
=================================================================
Total params: 62,371
Trainable params: 62,371
Non-trainable params: 0
_________________________________________________________________
None

These are the layers our CNN will consist of. We can see that the model consists of around 60k parameters. We now compile the model with the code snippet given below.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Let’s train our model.

training_history = model.fit(X_train, y_train, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)

# Output:
Epoch 1/50
201/201 [==============================] - 44s 4ms/step - loss: 0.6457 - accuracy: 0.6015
Epoch 2/50
201/201 [==============================] - 1s 4ms/step - loss: 0.4423 - accuracy: 0.8007
Epoch 3/50
201/201 [==============================] - 1s 4ms/step - loss: 0.3291 - accuracy: 0.8694
...few rows are hidden...
Epoch 48/50
201/201 [==============================] - 1s 4ms/step - loss: 0.0105 - accuracy: 0.9966
Epoch 49/50
201/201 [==============================] - 1s 4ms/step - loss: 0.0136 - accuracy: 0.9963
Epoch 50/50
201/201 [==============================] - 1s 4ms/step - loss: 0.0074 - accuracy: 0.9974

Finally, let’s check the model accuracy.

loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

# Output:
Testing Accuracy:  0.7622

The complete code used in this article is given below for quick reference.

import pandas as pd
import numpy as np
import re
import json
import gensim
import math
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import KeyedVectors
import keras 
from keras.models import Sequential, Model 
from keras import layers
from keras.layers import Dense, Dropout, Conv1D, GlobalMaxPooling1D
import os

from google.colab import drive
drive.mount('/content/drive')

def parse_data(file):
    for l in open(file,'r'):
        yield json.loads(l)

data = list(parse_data('/content/drive/MyDrive/Datasets/Sarcasm_Headlines_Dataset_v2.json'))
df = pd.DataFrame(data)

data.head()

df.pop('article_link')

df.head()

def text_clean(corpus):
    cleaned_corpus = pd.Series()
    for row in corpus:
        qs = []
        for word in row.split():
            p1 = re.sub(pattern='[^a-zA-Z]',repl=' ',string=word)
            p1 = p1.lower()
            qs.append(p1)
        cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
    return cleaned_corpus

def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

corpus = text_clean(df['headline'])
corpus = stopwords_removal(corpus)
corpus = lemmatize(corpus)
final_cleaned_corpus = [' '.join(x) for x in corpus]

final_cleaned_corpus[:10]

!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

os.chdir("/root/input")
!ls -al

model_path = '/root/input/GoogleNews-vectors-negative300.bin.gz'
model = KeyedVectors.load_word2vec_format(model_path, binary=True)

MAX_LENGTH = 10
VECTOR_SIZE = 300

def vectorize_data(data):
    
    vectors = []
    
    padding_vector = [0.0] * VECTOR_SIZE
    
    for i, data_point in enumerate(data):
        data_point_vectors = []
        count = 0
        
        tokens = data_point.split()
        
        for token in tokens:
            if count >= MAX_LENGTH:
                break
            if token in model.wv.vocab:
                data_point_vectors.append(model.wv[token])
            count = count + 1
        
        if len(data_point_vectors) < MAX_LENGTH:
            to_fill = MAX_LENGTH - len(data_point_vectors)
            for _ in range(to_fill):
                data_point_vectors.append(padding_vector)
        
        vectors.append(data_point_vectors)
        
    return vectors

vectorized_headlines = vectorize_data(final_cleaned_corpus)

for i, vec in enumerate(vectorized_headlines):
    if len(vec) != MAX_LENGTH:
        print(i)

train_div = math.floor(0.7 * len(vectorized_headlines))
train_div

X_train = vectorized_headlines[:train_div]
y_train = df['is_sarcastic'][:train_div]
X_test = vectorized_headlines[train_div:]
y_test = df['is_sarcastic'][train_div:]

print('The size of X_train is:', len(X_train), '\nThe size of y_train is:', len(y_train),
      '\nThe size of X_test is:', len(X_test), '\nThe size of y_test is:', len(y_test))

X_train = np.reshape(X_train, (len(X_train), MAX_LENGTH, VECTOR_SIZE))
X_test = np.reshape(X_test, (len(X_test), MAX_LENGTH, VECTOR_SIZE))
y_train = np.array(y_train)
y_test = np.array(y_test)

FILTERS=20
KERNEL_SIZE=10
HIDDEN_LAYER_1_NODES=50
HIDDEN_LAYER_2_NODES=25
DROPOUT_PROB=0.35
NUM_EPOCHS=50
BATCH_SIZE=100

model = Sequential()

model.add(Conv1D(FILTERS,
                 KERNEL_SIZE,
                 padding='same',
                 strides=1,
                 activation='relu', 
                 input_shape = (MAX_LENGTH, VECTOR_SIZE)))
model.add(GlobalMaxPooling1D())
model.add(Dense(HIDDEN_LAYER_1_NODES, activation='relu'))
model.add(Dropout(DROPOUT_PROB))
model.add(Dense(HIDDEN_LAYER_2_NODES, activation='relu'))
model.add(Dropout(DROPOUT_PROB))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

training_history = model.fit(X_train, y_train, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)

loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Final thoughts

In this article, we first saw what a CNN model is, how it works, what are its various components, and then we built a sarcasm detector with the help of a CNN model which gives us a pretty good accuracy.

Thanks for reading.