×

Long Short-Term Memory (LSTMs) for NLP

In the previous post, we have been introduced to Recurrent Neural Networks. In this post, we will build on that knowledge and look at an important variation of RNN called the LSTM and how it solves some problems faced by the RNN.

Intro to LSTMs

Long Short-Term Memory or LSTMs in short are a type of Recurrent Neural Network. LSTMs are mostly used to process sequences of data such as speech and video but they can also process single data points like images.

If you recall from our discussion on RNNs in the previous post, we had seen that RNNs face certain issues such as vanishing gradient and exploding gradient during training. LSTM overcomes these issues and helps us train without encountering these issues. We will see how LSTM does this.

The most important highlight or feature of LSTMs is that they can capture long-term dependencies effectively.

For example, suppose in a sentence the last word is associated with the first word somehow, here the RNNs won’t be able to do a good job due to vanishing gradients but since LSTMs have a memory they will be able to capture the relationship.

LSTMs are used to perform tasks such as

  • Handwriting recognition
  • Speech recognition
  • Anomaly detection
  • Processing and making predictions on time series data.

An LSTM cell

LSTM cells are the building blocks of the LSTM architecture. Understanding the LSTM architecture will help us understand the LSTM model and its working better.

We had discussed previously that LSTM has memory and hence can capture long-term dependency in text i.e. they are able to associate the current word with words in previous sentences too. This task is performed with the help of the LSTM cells.

In order to remember this information, the LSTM cell uses gates. Gates are the underlying structure that actually gives LSTMs their memory. Gates have the job of keeping relevant information in memory and discarding it when not needed.

The below image shows an LSTM cell.

Lstm

The LSTM cell has three gates namely forget gate, input gate, and output gate. As you can see, the input that is fed to the LSTM cell is a combination of the input signal and the previous hidden state (t -1). This input is similar to the RNN.

The concatenated input then passes through the cells and to different gates. The gates are nothing but simple feedforward neural networks (ANNs) along with activation functions. These gates are responsible for how much and exactly what of the input data must be stored by the cell and then discarded after use.

We will now focus on each gate individually to understand the complete working of the LSTM.

Forget gate

This is the first gate that will be encountered by our data as it enters the LSTM cell. It is the job of the forget gate to decide how much of the information from the memory should be discarded. This is a very important job in the LSTM network. Why?

Consider this. If we just keep adding memory to the network and don’t have a mechanism to discard it then it will overflow and lead to poor performance. Apart from that, consider the following sentence.

Ram went to school while Shyam was playing with his friends. 

In the above sentence, first, the LSTM needs to remember about Ram but then the focus is shifted to Shyam so the LSTM network should now forget the information about Ram. And this job of forgetting is done with the help of the forget gate.

Our network should keep track of dependencies but as soon as a newer more relevant dependency (Shyam in the above example) arrives, the LSTM should discard or forget the previous dependency and focus only on the newer dependency. Forget gate helps LSTM do just this.

In the LSTM cell image, you can see that after the concatenated input enters the cell, first it passes through the forget gate which consists of a sigmoid activation function. If you recall, the sigmoid activation gives an output in the range of 0 to 1.

Here, when the sigmoid function outputs 0, it means forget everything and discard everything from memory in the past. When the sigmoid output is 1, it indicates that the LSTM should remember or retain everything.

This signal from the gate is then multiplied with the memory cell as shown in the image and denoted by a blue circle. This multiplication is done so as to retain only the relevant information from the past.

Input gate

This is the second gate in our network that the input data encounters. The job of the input gate is to decide what information needs to be remembered currently and how much of this information needs to be remembered. So, basically from the data, this gate checks what to remember and how much to remember.

To understand what the input gate does, let’s take an example of a sentence as shown below.

Sachin was an excellent player. Virat is another excellent player. 

In the example sentence, the LSTM first remembers Sachin. Then in the next sentence, LSTM with the help of the forget gate forgets about Sachin and now focuses on Virat. The input gate does this job of focusing on Virat and making sure that LSTM now remembers him.

As you can see in the image of an LSTM cell above, the input gate has two activation functions. Both of these functions contribute to the complete working of the input gate. Let’s focus on these functions now.

The first activation function in the gate is the sigmoid activation function. As you know, the sigmoid function outputs values in the range 0 to 1. So, from the input data, the sigmoid function will take into consideration the data and it will output values between 0 to 1 for the data.

When the sigmoid function outputs 1 it indicates the LSTM to remember everything from the input information. Similarly, when the function outputs 0, it will indicate the LSTM that nothing is worth remembering from the input information.

Now, moving on to the second function used in the gate i.e. the tanh activation function. The tanh activation function gives an output in the range -1 to 1. Remember that it is a neural network with tanh activation not just a tanh activation function.

What the tanh function does is that it helps the LSTM to figure out from the present state the relevant information that the memory cell can get updated with. So here, the tanh function looks at the present memory state and from the state, it figures out what information can be updated.

The output from both the activation functions is multiplied (element-wise i.e. each element from both vectors). This multiplied output is added to the memory vector and thus updating information in the memory cell.

This multiplication is denoted in the image by blue circles. One blue circle denotes the multiplication of both the functions and the other blue circle denotes the multiplication of the output with the memory cell.

Output gate

This is the final gate in the LSTM cell architecture. The job of the output gate is to understand and decide which bits of information in the current step should be sent as output.

So essentially what the output gate does is that whatever information is updated in the current time state, the gate decides what and how much of the update information to send as an output of the LSTM cell.

Firstly, in the LSTM image above, we can see that the input data is applied to the sigmoid function. Again, it is the neural network plus the sigmoid activation function, although we have just shown the sigmoid function in the diagram.

The sigmoid function brings the values in the range between 0 and 1. Secondly, the information in the current time state that has already been updated (previous memory discarded and new information remembered) by the forget and input gate is now passed to the tanh activation function.

The tanh activation function will bring the output in the range of -1 to 1. This tanh output is then multiplied (element-wise) with the output from the sigmoid function and this final output is the output of the LSTM cell and it is also sent as hidden state input for the next time step.

We have seen the architecture and working of the LSTM cell which is the building block of the LSTM network.

How does LSTM solve the Vanishing Gradient problem?

We have seen that RNNs especially deep RNNs face the problems of vanishing and exploding gradients. LSTMs do not face the issue of vanishing gradients, but why?

Let’s look at the basic backpropagation path in the LSTM cell first. This path is shown in the image below with the help of a red line.

Lstm Backpropagation

Two reasons that contribute to the vanishing of gradients are the weights in the network and the derivatives of the activation functions.

In a network, if the weights or the derivatives of the activation functions are lesser than 1 then it may cause the gradients to vanish. On the other hand, if the weights or derivatives of activation functions are greater than 1 then they may contribute to the exploding gradients problem.

In the LSTM first, let’s talk about weights. The effective weight in the recurrency of the LSTM is equal to the forget gate as you can see in the image shown by the red line. The forget uses a sigmoid function hence the activation cannot exceed 1 and thus preventing exploding gradients.

When the forget gate is on, the activation is close to 1 hence, vanishing gradients is not a problem. If we talk about activation functions in the LSTM, then in the recurrency of the LSTM the activation function is the identity function with a derivate of 1.

As the derivative of the activation function and even the weights are are close to 1 and never exceed 1, LSTMs are very efficient at preventing the vanishing and exploding gradient problems. Hence, they can learn long-term dependencies successfully.

Using LSTMs to build a text generator

After understanding the theory of LSTMs and why they are so much more efficient than RNNs in NLP we now look at an application of LSTMs. We will build a text generator.

So what is text generation? Text generation is an area of research and application where we provide the model with some text data and it then generates more text data basically predicting the next occurring data.

Text generators can be used to generate words as we type on our phones, they can be used to generate stories, music, lyrics, etc. Here we try to generate text using LSTMs.

The data we will use for completing this activity can be downloaded from https://data.world/promptcloud/hotels-on-makemytrip-com. This is hotel data provided by Make my Trip. Download, extract and place the data in a folder you will be working in.

Let’s begin by importing the necessary libraries and modules.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import pandas as pd
import numpy as np
import re
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Embedding

After importing the libraries, we now read the dataset using the Pandas read_csv function. Make sure that you set the correct path to the file depending upon where it is saved in your system. We also print the first few rows of our dataset using the head method.

data = pd.read_csv('/content/drive/MyDrive/Datasets/makemytrip_com-travel_sample.csv')

data.head(5)

Output:

Hotel Data Ss

Take your time to check all the columns in the dataset and see what they include. We observe that there is information about a lot of hotels in different Indian cities.

Let us do some data exploration and analysis. We will check how many hotels are there per city in the dataset. We do that by calling value_counts on the desired column here, city.

data.city.value_counts()

# Output:
NewDelhiAndNCR    1163
Goa               1122
Mumbai             543
Jaipur             534
Bangalore          512
                  ... 
Chitradurga          1
Sankri               1
Vengurla             1
Bijapur              1
Balangir             1
Name: city, Length: 770, dtype: int64

We see that the top 3 cities with the most hotels are Delhi & NCR, Goa, and Mumbai. In this text generator, we will focus on generating hotel descriptions for hotels in Mumbai. With the help of LSTMs, our aim is to generate descriptions of hotels that are located in Mumbai.

So, we first have to filter out the hotels that are located in Mumbai from our dataset.

We define an array with Mumbai in it. You can add more cities to this array. Then from the data, in the city column, we filter out the rows where the location is Mumbai. We use isin to specifically filter out Mumbai. Basically, we are keeping all hotels based in Mumbai and discarding others.

array = ['Mumbai']

data = data.loc[data['city'].isin(array)]
data

We check the dataset quickly to see if we were able to filter out hotels based completely in Mumbai.

data.head(5)

Output:

Hotel Data Ss 2

We see from the output that our code was indeed correct as we now have the hotel data related only to Mumbai city.

As we discussed, we are interested in generating the hotel description so we need to work with the column named hotel_overview because it contains an overview of the hotel about its location, its facilities, etc. We want to work with this data. So, we’ll select this column.

data = data.hotel_overview
data

# Output:
294      Nestled in Mumbai, a city with strong historic...
309      3 km from Chhatrapati Shivaji International Ai...
321      Location Hotel Royal Garden is situated on Juh...
334      City Guest House is a beautiful property locat...
1238     Sai Residency Hotel is situated in the City of...
                               ...                        
18074                                                  NaN
19802                                                  NaN
19806    Hotel Rama Krishna is located in Mumbai, the m...
19808    Hotel Astoria is one of the leading budget pro...
19809    The most renowned accommodation service provid...
Name: hotel_overview, Length: 543, dtype: object

As you can see, we have assigned the contents of the desired column to the main dataframe we were working with. Now, we are interested in dropping Nan or null values that are present in this data. Null values provide us no benefit and actually hinder the performance of the model hence we drop them.

The code to do this is shown below. With the help of dropna() we can drop all the null values. Let us apply it to our dataset above.

data = data.dropna()
data

# Output:
294      Nestled in Mumbai, a city with strong historic...
309      3 km from Chhatrapati Shivaji International Ai...
321      Location Hotel Royal Garden is situated on Juh...
334      City Guest House is a beautiful property locat...
1238     Sai Residency Hotel is situated in the City of...
                               ...                        
18015    |Hotel Alfa Grand, a corporate budget hotel th...
18057    |Hotel Oasis Fort is one of the beautiful and ...
19806    Hotel Rama Krishna is located in Mumbai, the m...
19808    Hotel Astoria is one of the leading budget pro...
19809    The most renowned accommodation service provid...
Name: hotel_overview, Length: 490, dtype: object

If you observe, we previously had 543 rows but after applying dropna() we now have 490 rows in total. So, there were 53 null rows present in our data that we have now successfully dropped.

We now move on to data cleaning and preprocessing. First, we define a function for stopwords removal. In the code snippet below, we first load the stopwords and create a set out of it. A set contains unique values. Then in the function using list comprehension we filter out the data by removing stopwords.

stop = set(stopwords.words('english'))
def stopwords_removal(data_point):
    data = [x for x in data_point.split() if x not in stop]
    return data

Next, we define a function that will help us clean the data. In the function we make use of regular expressions to keep alphanumeric data, we covert the data to lowercase and we remove stopwords. We also create a few lists which we will check out in the upcoming sections.

def clean_data(data):
    cleaned_data = []
    all_unique_words_in_each_description = []
    for entry in data:
        entry = re.sub(pattern='[^a-zA-Z]',repl=' ',string = entry)
        entry = re.sub(r'\b\w{0,1}\b', repl=' ',string = entry)
        entry = entry.lower()
        entry = stopwords_removal(entry)
        cleaned_data.append(entry)
        unique = list(set(entry))
        all_unique_words_in_each_description.extend(unique)
    return cleaned_data, all_unique_words_in_each_description

A third function we will make use of is shown below. It basically creates a set (only unique values) from a list we had defined previously. Previously we had defined a list and this function will create a set of the values in that list and will also return the length of that set.

def unique_words(data):
    unique_words = set(all_unique_words_in_each_description)
    return unique_words, len(unique_words)

We now apply these functions to the data we have been working with. First, we apply clean_data to the dataframe and we get two lists. Then, we apply unique_words to the list that we got previously.

cleaned_data, all_unique_words_in_each_description = clean_data(data)
unique_words, length_of_unique_words = unique_words(all_unique_words_in_each_description)

Let’s get an overview of the data that we have now cleaned. We print out the first element (hotel description) from the cleaned_data list. This list contains all the hotel descriptions of Mumbai hotels.

cleaned_data[0]

# Output:
['nestled',
 'mumbai',
 'city',
 'strong',
 'historical',
 'links',
 'wonderful',
 'british',
 'architecture',
 'museums',
 'beaches',
 'places',
...

We also check the unique words in our data. This list contains all the unique words as the name suggests.

unique_words

# Output:
{'liquors',
 'landscape',
 'kalamboli',
 'sanghralaya',
 'capacities',
 'new',
 'normal',
 'running',
 'pao',
 'eighteen',
 'design',
...

What we are seeing as the output is basically our vocabulary in the dataset. Let’s also check the total words in our vocabulary i.e. vocabulary length (number of unique words).

length_of_unique_words

# Output
3395

We are now interested in creating a mapping of the words to the indices and a mapping from indices to words. Basically, what we want to do is that we want to take every word in our data vocab and assign an index to it so that when they pass the index, we will get the word and vice versa.

For completing this task, we define a function as shown in the code snippet below. The function takes in the words from our data vocab and defines two dictionaries. Then, running a for loop by making use of enumerate we store the values in the dictionaries we had defined previously.

enumerate helps us to map the words with their indices and vice versa. One dictionary contains the mappings of words and their indices and the second one contains the mapping of indices and their words.

def build_indices(unique_words):
    word_to_idx = {}
    idx_to_word = {}
    for i, word in enumerate(unique_words):
        word_to_idx[word] = i
        idx_to_word[i] = word
    return word_to_idx, idx_to_word

We make use of the function we defined above on the list we have been working with. The function returns two dictionaries. Refer to the code below.

word_to_idx, idx_to_word = build_indices(unique_words)

Let’s quickly check out these dictionaries one by one. The first one is word_to_idx where we have a word and its corresponding index.

word_to_idx

# Output:
{'palatable': 0,
 'aristocracy': 1,
 'waterfront': 2,
 'things': 3,
 'supreme': 4,
 'hatole': 5,
 'dwellers': 6,
 'idle': 7,
 'spotlessly': 8,
 'land': 9,
 'functional': 10,
 'mod': 11,
 'trident': 12,
...

The second one is index_to_word where we have an index and its corresponding word.

idx_to_word

# Output:
{0: 'palatable',
 1: 'aristocracy',
 2: 'waterfront',
 3: 'things',
 4: 'supreme',
 5: 'hatole',
 6: 'dwellers',
 7: 'idle',
 8: 'spotlessly',
 9: 'land',
 10: 'functional',
 11: 'mod',
 12: 'trident',
...

We now build our training corpus using the following block of code. Basically, we are aiming to train the model by increasing the size of the sequence by 1. In every sequence, the model will try to predict the last word in the sequence.

Take for example the sequences below.

  • hotels, cheap
  • hotels, cheap, scenery

The model will the sequence 1 and it tries to predict the last word i.e. cheap. Then, in the second sequence, it will try to predict the word scenery by looking at the words hotels and cheap.

In the prepare_corpus function two inputs are provided. One is the cleaned data (hotel descriptions) and the second is the word_to_idx dictionary. We are generating the sequences of length that increase by 1, just as we had seen in the examples above.

We also convert the sentences to their index values by using the dictionary we had created in the previous steps.

def prepare_corpus(corpus, word_to_idx):    
    sequences = []
    for line in corpus:
        tokens = line
        for i in range(1, len(tokens)):
            i_gram_sequence = tokens[:i+1]
            i_gram_sequence_ids = []
            
            for j, token in enumerate(i_gram_sequence):
                i_gram_sequence_ids.append(word_to_idx[token])
                
            sequences.append(i_gram_sequence_ids)
    return sequences

Let’s apply this function to our data that has been ready for processing.

sequences = prepare_corpus(cleaned_data, word_to_idx)

Now that we have applied the function and created the training corpus let’s check out the sequence with the maximum length.

max_sequence_len = max([len(x) for x in sequences])
max_sequence_len

# Output:
308

The sequence with the maximum length is 308.

Let’s also quickly verify that the sequences are just as we had seen in the example above. We check the first and second sequence in the training data. Remember that we had converted it to indices so we expect indices rather than words.

print(sequences[0])
print(sequences[1])

# Output:
[893, 2139]
[893, 2139, 2591]

From the output, we see that the size of the sequence keeps increasing by 1.

If we enter the above-obtained indices and enter them in the index_to_word dictionary we will get the words in the first few sequences.

print(idx_to_word[893])
print(idx_to_word[2139])
print(idx_to_word[2591])

# Output:
nestled
mumbai
city

As we see in the output, the words that are mapped to indices we have entered are shown here. So, the first sequence is “nestled, mumbai”.

The total number of sequences we have is given by running the code below. Our model will train on these many sequences and will predict the last word in each of these sequences.

len(sequences)

# Output:
51836

These are the total number of sequences we are working with. It approximately equals 50K sequences.

Before building and training the model, we have one more thing to do. We need to pad the data. But why do we need to perform padding?

The model takes in data in fixed sizes. So, when we pass the training data we had created, we want every training sample to be of the same size. That is why we pad the data. To do this, we take the length of the longest training sample and pad other samples so that all are equal in length.

This can be done by running the code snippet below. Basically, this function will pad the data that we have prepared so far. We also apply this function to our processed data.

def build_input_data(sequences, max_sequence_len, length_of_unique_words):
    sequences = np.array(pad_sequences(sequences, maxlen = max_sequence_len, padding = 'pre'))
    X = sequences[:,:-1]
    y = sequences[:,-1]
    y = np_utils.to_categorical(y, length_of_unique_words)
    return X, y

X, y = build_input_data(sequences, max_sequence_len, length_of_unique_words)

We can create the LSTM model to generate text. This is shown in the code block below. We define a function that helps us build the model. We start with the embedding layer, then add the LSTM layer. Next, we add a dropout layer and finally a dense layer.

We also compile the model using categorical_crossentropy and using the adam optimizer.

def create_model(max_sequence_len, length_of_unique_words):
    model = Sequential()
    model.add(Embedding(length_of_unique_words, 10, input_length=max_sequence_len - 1))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(length_of_unique_words, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model

In the code snippet below, we create the model and print the summary.

model = create_model(max_sequence_len, length_of_unique_words)
model.summary()

# Output:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 307, 10)           33950     
_________________________________________________________________
lstm (LSTM)                  (None, 128)               71168     
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 3395)              437955    
=================================================================
Total params: 543,073
Trainable params: 543,073
Non-trainable params: 0

We see that the model contains around 5 lakh parameters.

Let’s train the model. We train the model for 100 epochs or iterations.

model.fit(X, y, batch_size = 512, epochs=100)

# Output:
Epoch 1/100
102/102 [==============================] - 26s 52ms/step - loss: 7.3385
Epoch 2/100
102/102 [==============================] - 5s 51ms/step - loss: 6.5646
Epoch 3/100
102/102 [==============================] - 5s 51ms/step - loss: 6.5169
Epoch 4/100
102/102 [==============================] - 5s 51ms/step - loss: 6.4240
Epoch 5/100
102/102 [==============================] - 5s 52ms/step - loss: 6.3452
...

Our model is now trained. Let’s define a function to generate text. The function defined below basically applies all the preprocessing steps.

def generate_text(seed_text, next_words, model, max_seq_len):
    for _ in range(next_words):
        cleaned_data = clean_data([seed_text])
        sequences= prepare_corpus(cleaned_data[0], word_to_idx)
        sequences = pad_sequences([sequences[-1]], maxlen=max_seq_len-1, padding='pre')
        predicted = model.predict_classes(sequences, verbose=0)
        output_word = ''
        output_word = idx_to_word[predicted[0]]            
        seed_text = seed_text + " " + output_word
          
    return seed_text.title()

Let’s use this function to generate text. We want the model to generate 30 words based on the input provided.

print(generate_text("in Mumbai there we need", 30, model, max_sequence_len))

# Output:
In Mumbai There We Need Hotel Located Distance Km Sahar International Airport Km Vile Parle 
Railway Station Km Bus Stand Popular Tourist Spots Like Gateway India Km Haji Ali Dargah Km Mahalakshmi Temple Km Mount

We just provided some input and the model automatically starts to generate text. That is cool! You can try it with other sentences too.

The complete code used in this article is given below for quick reference.

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import pandas as pd
import numpy as np
import re
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Embedding

data = pd.read_csv('/content/drive/MyDrive/Datasets/makemytrip_com-travel_sample.csv')
data.head(5)

data.city.value_counts()

array = ['Mumbai']
data = data.loc[data['city'].isin(array)]
data

data = data.hotel_overview
data

data = data.dropna()
data

stop = set(stopwords.words('english'))
def stopwords_removal(data_point):
    data = [x for x in data_point.split() if x not in stop]
    return data

def clean_data(data):
    cleaned_data = []
    all_unique_words_in_each_description = []
    for entry in data:
        entry = re.sub(pattern='[^a-zA-Z]',repl=' ',string = entry)
        entry = re.sub(r'\b\w{0,1}\b', repl=' ',string = entry)
        entry = entry.lower()
        entry = stopwords_removal(entry)
        cleaned_data.append(entry)
        unique = list(set(entry))
        all_unique_words_in_each_description.extend(unique)
    return cleaned_data, all_unique_words_in_each_description

def unique_words(data):
    unique_words = set(all_unique_words_in_each_description)
    return unique_words, len(unique_words)

cleaned_data, all_unique_words_in_each_description = clean_data(data)
unique_words, length_of_unique_words = unique_words(all_unique_words_in_each_description)

cleaned_data[0]
unique_words
length_of_unique_words

def build_indices(unique_words):
    word_to_idx = {}
    idx_to_word = {}
    for i, word in enumerate(unique_words):
        word_to_idx[word] = i
        idx_to_word[i] = word
    return word_to_idx, idx_to_word

word_to_idx, idx_to_word = build_indices(unique_words)
word_to_idx
idx_to_word

def prepare_corpus(corpus, word_to_idx):
    
    sequences = []
    for line in corpus:
        tokens = line
        for i in range(1, len(tokens)):
            i_gram_sequence = tokens[:i+1]
            i_gram_sequence_ids = []
            
            for j, token in enumerate(i_gram_sequence):
                i_gram_sequence_ids.append(word_to_idx[token])
                
            sequences.append(i_gram_sequence_ids)
    
    return sequences
sequences = prepare_corpus(cleaned_data, word_to_idx)

max_sequence_len = max([len(x) for x in sequences])
max_sequence_len

print(sequences[0])
print(sequences[1])

print(idx_to_word[893])
print(idx_to_word[2139])
print(idx_to_word[2591])

len(sequences)

def build_input_data(sequences, max_sequence_len, length_of_unique_words):
    sequences = np.array(pad_sequences(sequences, maxlen = max_sequence_len, padding = 'pre'))
    X = sequences[:,:-1]
    y = sequences[:,-1]
    y = np_utils.to_categorical(y, length_of_unique_words)
    return X, y

X, y = build_input_data(sequences, max_sequence_len, length_of_unique_words)

def create_model(max_sequence_len, length_of_unique_words):
    model = Sequential()
    model.add(Embedding(length_of_unique_words, 10, input_length=max_sequence_len - 1))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(length_of_unique_words, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model

model = create_model(max_sequence_len, length_of_unique_words)
model.summary()

model.fit(X, y, batch_size = 512, epochs=100)

def generate_text(seed_text, next_words, model, max_seq_len):
    for _ in range(next_words):
        cleaned_data = clean_data([seed_text])
        sequences= prepare_corpus(cleaned_data[0], word_to_idx)
        sequences = pad_sequences([sequences[-1]], maxlen=max_seq_len-1, padding='pre')
        predicted = model.predict_classes(sequences, verbose=0)
        output_word = ''
        output_word = idx_to_word[predicted[0]]            
        seed_text = seed_text + " " + output_word
          
    return seed_text.title()

print(generate_text("in Mumbai there we need", 30, model, max_sequence_len))

Final Thoughts

In this article, we have seen what LSTMs are, how they work, how they overcome the challenges faced by RNNs, and finally, we saw how to build a text generator using LSTMs.