×

Using BERT as an Embedder

We will be using the same base model but we won’t be using making embedding layer but using BERT embedding layer. We won’t train the weights of the BERT but we will use it as a vector representation for our words and see how it will improve our model.

You can use this approach when you have your personal NLP model just need to add an embedding layer at the beginning of it and get a more powerful and more efficient model with that.

We will use the same procedure for Importing the library, cleaning the text.

Step 1: Import Dependencies

Importing libraries for data manipulation, ranging, and text processing too.

import numpy as np
import math
import re
import pandas as pd
from bs4 import BeautifulSoup
import random

from google.colab import drive

!pip install bert-for-tf2
!pip install sentencepiece

To make the code lighter we will import the Keras layers.

try:
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers
import bert

Step 2: Data Preprocessing

We will import the below code, so we can have access to all the folders from our drives

drive.mount('/content/drive')

Let’s import the data frame into the data variable and give the names of the column for the data frame.

cols = ['sentiment', 'id', 'date', 'query', 'user', 'text']

data = pd.read_csv('/content/drive/MyDrive/BERT/testdata.manual.2009.06.14.csv',
                   header=None,
                   names=cols,
                   engine='python',
                   encoding='latin1')
data.head()

Now the data looks like this

Image 362

Let’s remove some of the columns like id, date, query, and  user as they are of no use right now

data.drop(['id', 'date', 'query', 'user'], axis=1, inplace=True) 

The data frame looks like this it’s a lighter version of the previous data frame.

data.head()
Image 363

Step 3: Data Cleaning

After completing the loading process let’s clean the text.

  • First, let’s remove digits from the text and replace it with whitespace
  • We will find the URL from the text and replace it with whitespace
  • We will remove the punctuations from the text and replace them with whitespace
  • We will remove the extra whitespace from the text

The full code for cleaning text, run this full code.

def clean_tweet(tweet):
  tweet = BeautifulSoup(tweet, 'lxml').get_text()
  tweet = re.sub(r"@[A-Za-z0-9]+", ' ', tweet)
  tweet = re.sub(r"https?://[A-Za-z0-9./]+", ' ', tweet)
  tweet = re.sub(r"[^a-zA-Z.!?']", ' ', tweet)
  tweet = re.sub(r" +", ' ', tweet)
  return tweet

We will keep the clean text into a data_clean

data_clean = (clean_tweet(tweet) for tweet in data.text)

We will change the data_labels to 1 because our data labels are not in a sequence they are 0 for negative and 4 for positive. So we are going to change them to 0 and 1 respectively.

data_labels = data.sentiment.values
data_labels[data_labels == 4] = 1

Step 4: Tokenization

We need to create a BERT layer to have access to metadata for the tokenizer (like vocab size). Let’s create our first BERT layer by calling hub; TensorFlow hub is where everything is stored, all the tweets and models are stored and we call from hub.KerasLayer

In the given link for the BERT model, we can see the parameters like L=12 and so on.

FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)

We will create a vocab file and store it in a variable name vocab_file

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()

This will tell us if we are lowercasing the text or not

do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)

If we do a single sentence classification, so let’s write our new encode sentence function.

def encode_sentence(sent):
    return ["[CLS]"] + tokenizer.tokenize(sent) + ["[SEP]"]
data_inputs = [encode_sentence(sentence) for sentence in data_clean]

We need to have three inputs for each sentence.

The first layer is just a list of masks which means it indicates where our values in the sentences and

We will need a segment input; it is a sequence of ones and zeros and here zeros will indicate that we are currently in the first instance.

Step 5: Dataset creation

Input for the bert layer. For the first layer, we will create a function where the ids of the tokens will return.

For the second layer, we will need the mask to correspond to their padding mask it will return 1 when we are not using the padding token and 0 when we are using the padding token.

Now we will write a function that will tell us if the token is in the firsts sentence or the second sentence.

def get_ids(tokens):
    return tokenizer.convert_tokens_to_ids(tokens)

def get_mask(tokens):
    return np.char.not_equal(tokens, "[PAD]").astype(int)

def get_segments(tokens):
    seg_ids = []
    current_seg_id = 0
    for tok in tokens:
        seg_ids.append(current_seg_id)
        if tok == "[SEP]":
            current_seg_id = 1-current_seg_id # convert 1 into 0 and vice versa
    return seg_ids

We want to have only 1 sequence as input but we will have 3 sequences as input tokens. After that, we will shuffle the positive and negative setneces and then sort the values and store them in data_with_len.

data_with_len = [[sent, data_labels[i], len(sent)]
                 for i, sent in enumerate(data_inputs)]
random.shuffle(data_with_len)
data_with_len.sort(key=lambda x: x[2])
sorted_all = [([get_ids(sent_lab[0]),
                get_mask(sent_lab[0]),
                get_segments(sent_lab[0])],
               sent_lab[1])
              for sent_lab in data_with_len if sent_lab[2] > 7]

A list is a type of iterator so it can be used as a generator for a dataset

all_dataset = tf.data.Dataset.from_generator(lambda: sorted_all,
                                             output_types=(tf.int32, tf.int32))

We will get the number of batches we have right now and store them in NB_BATCHES and have test batches in NB_BATCHES_TEST which is divided by 10.

Before splitting we will shuffle all our dead sets because now we have the shortest sentences at the beginning and the longest sentences at the end. Actually, when our dataset is not big, we can get the buffer size to be exactly the number of batches. So that’s the best way to do.

Splitting the batches and storing the data into train and test

BATCH_SIZE = 32
all_batched = all_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

NB_BATCHES = math.ceil(len(sorted_all) / BATCH_SIZE)
NB_BATCHES_TEST = NB_BATCHES // 10
all_batched.shuffle(NB_BATCHES)
test_dataset = all_batched.take(NB_BATCHES_TEST)
train_dataset = all_batched.skip(NB_BATCHES_TEST)

Step 6: Model building

Let’s create a tokenized sentence and try to call the BERT layer. Now we will create 3 different types of input which we need to give to BERT then we can call the BERT layer.

In bert_layer we will first pass the ids, makes value, and segments.

my_sent = ["[CLS]"] + tokenizer.tokenize("Roses are red.") + ["[SEP]"]
bert_layer([tf.expand_dims(tf.cast(get_ids(my_sent), tf.int32), 0),
            tf.expand_dims(tf.cast(get_mask(my_sent), tf.int32), 0),
            tf.expand_dims(tf.cast(get_segments(my_sent), tf.int32), 0)])

The first one will be for the classification task and the second one will be for the token level classification task. There is only one thing which we added is a function embed_with_bert which consists of the id, which tells if the sentence is A or B and last we get all the values.

class DCNNBERTEmbedding(tf.keras.Model):
    
    def __init__(self,
                 nb_filters=50,
                 FFN_units=512,
                 nb_classes=2,
                 dropout_rate=0.1,
                 name="dcnn"):
        super(DCNNBERTEmbedding, self).__init__(name=name)
        
        self.bert_layer = hub.KerasLayer(
            "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
            trainable=False)
        
        self.bigram = layers.Conv1D(filters=nb_filters,
                                    kernel_size=2,
                                    padding="valid",
                                    activation="relu")
        self.trigram = layers.Conv1D(filters=nb_filters,
                                     kernel_size=3,
                                     padding="valid",
                                     activation="relu")
        self.fourgram = layers.Conv1D(filters=nb_filters,
                                      kernel_size=4,
                                      padding="valid",
                                      activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=FFN_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if nb_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=nb_classes,
                                           activation="softmax")
    
    def embed_with_bert(self, all_tokens):
        _, embs = self.bert_layer([all_tokens[:, 0, :],
                                   all_tokens[:, 1, :],
                                   all_tokens[:, 2, :]])
        return embs
    
    def call(self, inputs, training):
        x = self.embed_with_bert(inputs)

        x_1 = self.bigram(x) # (batch_size, nb_filters, seq_len-1)
        x_1 = self.pool(x_1) # (batch_size, nb_filters)
        x_2 = self.trigram(x) # (batch_size, nb_filters, seq_len-2)
        x_2 = self.pool(x_2) # (batch_size, nb_filters)
        x_3 = self.fourgram(x) # (batch_size, nb_filters, seq_len-3)
        x_3 = self.pool(x_3) # (batch_size, nb_filters)
        
        merged = tf.concat([x_1, x_2, x_3], axis=-1) # (batch_size, 3 * nb_filters)
        merged = self.dense_1(merged)
        merged = self.dropout(merged, training)
        output = self.last_dense(merged)
        
        return output

Step 7: Traning

Let’s set the hyper-parameters and all the information related to training.

NB_FILTERS = 100
FFN_UNITS = 256
NB_CLASSES = 2

DROPOUT_RATE = 0.2

BATCH_SIZE = 32
NB_EPOCHS = 5

Let’s give all the parameter values.

Dcnn = DCNNBERTEmbedding(nb_filters=NB_FILTERS,
                         FFN_units=FFN_UNITS,
                         nb_classes=NB_CLASSES,
                         dropout_rate=DROPOUT_RATE)

Now we need to combine and give it to the optimizer and loss function to find accuracy.

if NB_CLASSES == 2:
    Dcnn.compile(loss="binary_crossentropy",
                 optimizer="adam",
                 metrics=["accuracy"])
else:
    Dcnn.compile(loss="sparse_categorical_crossentropy",
                 optimizer="adam",
                 metrics=["sparse_categorical_accuracy"])

We will create a checkpoint list in the drive. We will need a checkpoint because after the training is done we want to get back the weights that have been trained and to use them later.

checkpoint_path = "./drive/MyDrive/projects/BERT/ckpt_bert_embedding/"
ckpt = tf.train.Checkpoint(Dcnn=Dcnn)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=1)
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("Latest Checkpoint restored!")

We will make a custom class which we will give to the fit functions so that it executes any lines of codes between each epoch or between its patch.

class MyCustomCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        ckpt_manager.save()
        print("Checkpoint saved at {}.".format(checkpoint_path))

Now we will use a callback for our fit method.

Dcnn.fit(train_dataset,
         epochs=NB_EPOCHS,
         callbacks=[MyCustomCallback()])

Step 8: Evaluation

Let’s see the results, by running the below code.

results = Dcnn.evaluate(test_dataset)
print(results)

We will make a prediction function where we will input the sentence and we will get an output as negative or positive.

def get_prediction(sentence):
    tokens = encode_sentence(sentence)

    input_ids = get_ids(tokens)
    input_mask = get_mask(tokens)
    segment_ids = get_segments(tokens)

    inputs = tf.stack(
        [tf.cast(input_ids, dtype=tf.int32),
         tf.cast(input_mask, dtype=tf.int32),
         tf.cast(segment_ids, dtype=tf.int32)],
         axis=0)
    inputs = tf.expand_dims(inputs, 0)

    output = Dcnn(inputs, training=False)

    sentiment = math.floor(output*2)

    if sentiment == 0:
        print("Output of the model: {}\nPredicted sentiment: negative.".format(
            output))
    elif sentiment == 1:
        print("Output of the model: {}\nPredicted sentiment: positive.".format(
            output))

Let’s call the function with an input sentence.

get_prediction("This movie was pretty interesting.")