Tokenization in NLP

Tokenization is the process of breaking down the documents or sentences into chunks called tokens. These tokens are mostly words, characters, or numbers but they can also be extended to include punctuation marks, symbols, and at times, understandable emotions.

This is the first step we need to take to build a vocabulary. If you are wondering what a vocabulary means in the context of NLP, then a vocabulary is nothing but a set of unique words in our documents. So a vocabulary will consist of all the words present in our document without any repetition of words.

Tokenization is the fundamental thing to do in any text-preprocessing activity.

Think of tokenization as a segmentation technique. You segment a large piece into smaller pieces. In a similar manner, here you are taking large chunks of text and are trying to break them down into smaller meaningful chunks.

A word is a token in a sentence and a sentence is a token is a paragraph.

This brings us to another important point and the point is that each token will carry some meaning. Each token will contribute to the meaning of the overall text. This point will become clearer when we go through some practical examples.

Let us take a sentence as an example to see how this works out.

“The capital of Japan is Tokyo”

If we want to split the sentence into its constituent words, we would expect the output to be some thing like this.

“The”, “capital”, “of”, “Japan”, “is”, “Tokyo”

Table of Contents

1. Implementing Tokenization in Python

2. Tokenization using the NLTK library

2.1. 1. Using the Treebank tokenizer

2.2. 2. Using the TweetTokenizer

3. Final thoughts

Implementing Tokenization in Python

Open a new Colab notebook or fire up your favorite IDE. If you are new to Colab and don’t know how to use Colab, I would suggest that you read the previous tutorial before moving forward. It will prove beneficial.

In the below code, I’m using a split() function from Python.

sentence = "The capital of Japan is Tokyo"
sentence.split()

In the code snippet above, we are using the “split” function of the string module. This will split the string based on whitespaces i.e. “ ”

The output is as follows

Output:
['The', 'capital', 'of', 'Japan', 'is', 'Tokyo']

As we can see, the sentence is split into its constituent words. This is tokenization using the basic split function. Each token is carrying a specific meaning.

However, there is a drawback with using the split() method to tokenize text. Consider the following sentence.

sentence = "I'm going to travel to Tokyo"
sentence.split()

Output:
["I'm", 'going', 'to', 'travel', 'to', 'Tokyo']

Observe the output.

As you can see, the word “I’m” which consists of separate words “I” and “am” is separated by an apostrophe. But when the sentence is split, we get the unchanged output as “I’m”. The split method doesn’t split this word and returns it as it is, which is not what we expect.

Thus, the split method performs poorly and doesn’t know how to deal with situations containing apostrophes.

Tokenization using the NLTK library

Now let us utilize the NLTK library and see how we can tokenize text with it. In the previous section, we were introduced to this library. NLTK has several tokenizers. We will be using two of them here.

First, we will use a tokenizer called the Treebank tokenizer.

1. Using the Treebank tokenizer

This a very popular tokenizer. Here, words are split mostly based on punctuations. This overcomes the problem of the split() method by splitting text with punctuations.

Let’s tokenize a sentence.

from nltk.tokenize import TreebankWordTokenizer
sentence = "I'm going to fly to Tokyo and the ticket cost me around $1000"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

In the above code snippet, we first import the Treebank tokenizer provided by the NLTK library. Then we define our sentence and initialize the tokenizer. Finally, we tokenize the sentence. Let’s take a look at the output.

As you can see from the output, it does a better job than the basic split() method by tokenizing the words with punctuations. This is an improvement.

2. Using the TweetTokenizer

Now, let us look at an interesting tokenizer provided by the NLTK package called the TweetTokenizer. It is used to preprocess mainly social media texts and tweets.

How is the TweetTokenizer useful?

You must have noticed on Twitter or any other social media platform that people tag each other using their social media handles and they also use emoticons, hashtags, etc. To help us parse such complex texts and make things more understandable, the TweetTokenizer plays an important role.

Let us go through a practical example to get things into perspective.

@elton_landers the new tech launched this season is dopeeeeeee!!! :-D #techlover #newseason <3

Let us try to parse the above sentence using the TweetTokenizer.

from nltk.tokenize import TweetTokenizer
sentence = "@elton_landers the new tech launched this season is dopeeeeeee!!! :-D #techlover #newseason <3"
tokenizer = TweetTokenizer()
tokenizer.tokenize(sentence)

In the above code snippet, we have imported the TweetTokenizer from the NLTK library. Then, we have assigned our sentence to a variable. Next, we initialized the TweetTokenizer we imported in step one.

Finally, we tokenize the sentence. Let’s check the output.

As we can observe in the output, the TweetTokenizer tokenizes the sentence into separate tokens such as emoticons, handles, etc.

This is a great tokenizer when working with social media data.

Now, let us look at what else this tokenizer can do. There is a parameter available to us known as the “strip_handles” parameter. By setting this parameter to True, we can remove handles from the tweets.

If you notice the word “dopeeeeeee” in the sentence, the letter “e” is repeated several times. Letters are sometimes repeated to show enthusiasm and excitement on social media. We can clip the word and remove the repeating words by setting the parameter “reduce_len=true”.

sentence = "@elton_landers the new tech launched this season is dopeeeeeee!!! :-D #techlover #newseason <3"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokenizer.tokenize(sentence)

As we can notice in the code snippet above, we have set the “strip_handles” and “reduce_len” parameter to True. Let us check the output.

If you compare the above output with the previous one, you will notice that the handles are stripped and the word with repeating letters is clipped to just 3 letters. This shows that the tokenizer will by default reduce the length of the repeating letters to 3.

Final thoughts

In this article, we have gone through the meaning and importance of tokenization. Then, with the help of the NLTK library, we have seen how to tokenize text efficiently by utilizing two tokenizers i.e. the Treebank and the TweetTokenizer.

Thanks for reading.