Introduction to Transformers and BERT for NLP

Till now we have seen some sophisticated NLP architectures including ANNs, CNNs, RNNs, and their variants. But transformers have shown tremendous potential and are currently replacing these well know architectures that we have discussed so far.

In this article, we get an introduction and an overview of the transformer architecture and its working.

The transformer

The transformer is what can be considered the state-of-the-art in Natural Language Processing. It is a deep learning model i.e. it makes use of the neural network architecture that we have been discussing in the previous few articles.

As we just discussed in the introduction, the transformer has replaced the RNN and LSTM models for various tasks. Recent NLP models such as BERT, GPT, T5, etc. are based on the transformer architecture.

A basic idea of the architecture the transformer uses is of the encoder and decoder architecture. We will deep dive into what it means and how it works in detail.

The main problem with RNNs and LSTMs was that they failed to capture long-term dependencies. What I mean by long-term dependencies is that they could not capture the context of words that are very far from the previous sentences or maybe even paragraphs.

As the distance between the words increased the models failed to capture the relationship between them efficiently. To overcome this issue the transformer was introduced. As we discussed previously, the transformer is state-of-the-art i.e. the most sophisticated model for solving NLP problems right now.

Transformer mechanism

The transformer uses a mechanism called attention. So remember, although the transformer is a deep learning architecture is does not use recurrence as we had seen in RNNs and LSTMs. It completely gets rid of recurrence. Recurrence is the mechanism through which RNNs connect to the layers through time.

More specifically, transformers use a variation of the attention mechanism called self-attention.

Basic architecture of the transformer

The image below shows the basic architecture of the transformer.

Transformer Basic Arch 1

The encoder and decoder make up the transformer. These are two blocks that work as a team to make up the transformer. In the image, the transformer is shown to perform the task of machine translation. Machine translation involves translating text from one language to another.

Here, a sentence from English is being translated to French.

Components of transformer

The transformer architecture consists of various components. These components are broadly divided into encoder and decoder.

The encoder structure further consists of Multi-head attention, feedforward network, positional encoding, and embedding layers. The encoding structure of the BERT consists of various encoder layers. The output of the encoder layer is then given to the decoder layer.

The decoder structure of the transformer consists of output embedding, positional encoding, decoder layers (masked multi-head attention, multi-head attention, feedforward). Together these work as a transformer architecture.

Introduction to BERT

BERT stands for Bidirectional Encoder Representations from Transformers. Previously, we have seen the basic architecture of the transformer model. So from the transformer model which consists of encoder and decoder, BERT is simply the encoder representation from that transformer architecture.

So, the idea of BERT is that we generate encoder representations (representations are the mathematical equivalent of text data) for the language we are working with. We are not training the BERT model for any specific task instead we focus on generating the best and most efficient encodings of text.

Meaning of the term “Bidirectional”

But what does the word “bidirectional” in the name signify? It signifies that we can pass text to the BERT from both directions i.e. left-to-right as well as right-to-left.

We usually read English from left to right. But, BERT can read languages from both directions simultaneously. This enables BERT to generate superior encodings. Bidirectional also means

Once these encodings are generated we say that BERT is now trained and we can use it for niche tasks.

Final Thoughts

In this introductory article, we have seen what transformers are, we have seen the architecture and components of transformers and finally, we were introduced to BERT which is nothing but the Bi-directional Encoder Representations from Transformers.