×

Sentimental Analysis with Spacy

Let’s recall what is Natural Language Processing NLP, which is broadly defined as automatic manipulation of natural languages, like speech and text by software.

What is sentimental analysis?

Sentimental analysis is the use of Natural Language Processing (NLP), Machine Learning (ML), or other data analysis techniques to analyze the data and provides some insights from the data.

Sentimental analysis is the process of detecting positive, negative, or neutral sentiment in the text.

Let’s see where sentimental analysis works

  • Customer satisfaction
  • Gauge brand reputation
  • Social media conversation
  • Chatboat reactions
  • Finding inisghts from feedback forms
  • To improve processes for decision making

Explaining in detail the example of customer satisfaction when a customer writes about the products whether they like it or not a human can read that feedback and understand whether the feedback is positive or negative or it gives a general idea.

But to automate this process and ease it we do sentimental analysis, with sentiment analysis we can automatically analyze 4000+ reviews about products. Doesn’t it sound amazing?

Types of Sentiment Analysis

There are different types of sentiment analysis, We will discuss four important types and popular use cases of sentiment analysis

1. Fine-Grained Sentiement

Fine-Grained sentiment analysis gives you an understanding of customer feedback. We can obtain a precise result in terms of the polarity of the input. The labels of the review will be as

  • Positive
  • Very positive
  • Negative
  • Very Negative
  • Neutral

So now we can interpret the result as 5 stars very positive 4 stars positive and so on.

2. Intent Analysis

The intent analysis is a deeper analysis compared to basic sentiment analysis here, we can determine whether the data is a suggestion, query, or complaint. Here we are able to capture the pure intent of the data.

Let’s take an example to understand clearly if you run a social media page for a company and someone stages your page sow with the intent analysis we can find out if the tags are genuine or just some people are messing around. This will help to improve social engagement and customer experience.

3. Emotion Detection

Emotion detection is one of the commonly used sentiment analyses where we detect the emotion behind the data. Our aim is to find out whether the given sentence is happy, sad, angry, frustrated.

This type of analysis will be helpful when we have the feedback data and we need to analyze if the product is doing well in the market.

4. Aspect Based sentiment Analysis

When we need to analyze the specific aspect of the data we will do aspect-based sentiment analysis. Let’s take an example for clear understanding lets say you launch a product in the market and you need to analyze which aspect of the product they liked the most. The aspect can be packaging, price, etc.

Steps to perform sentimental analysis

For doing sentimental analysis

  • we first need to gather data
  • clean the data

Cleaning textual data for Sentiment Analysis

Here we are taking an example of fine Grained Sentiment analysis, where we need to classify the labels of the movies. The dataset which we are going to use is taken from the Kaggle of movie dataset. You can access the data from here.

Starting the sentiment analysis by downloading the necessary library. For Data exploration we will download the pandas library, for visualization we will use the Plotly library. Execute the below command

import pandas as pd
import plotly.express as px
import string 
from spacy.lang.en.stop_words import STOP_WORDS  
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

Importing the dataset and analyzing it to get a basic understanding.

df = pd.read_table('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip')
df[:2]
Image 321

The data contains 4 features; Phraseid, SentenceId, Phrase, and Sentiment.

Let’s see the information about the dataset by using info() the function of pandas.

df.info()
Image 322

Let’s see the distribution of sentiment using value_counts() function of pandas. For each movie, the model has to predict the labels for the sentiment from 0 to 4. The sentiment labels are:

  • 0 –> Negative
  • 1 –> Less NEgative
  • 2 –> Neutral
  • 3 –> Less Positive
  • 4 –> Positive
df.Sentiment.value_counts()
Image 320

Let’s visualize the features first

We will find the length of the phrase using len() function and plotting the chart for the same.

df['Length'] = df['Phrase'].apply(len)
px.histogram(df, x='Length', nbins=50)
Image 323

Most of the phrase length lies between the range of 10 to 20 but some of the phrase lengths go up to 250.

Let’s see which type of sentiment we have in our text.

fig = px.histogram(df, x='Sentiment')
fig.update_layout(bargap=0.2)
Image 327

Now we will create a function to clean the text and lemmatized the tokens. For cleaning text, we made a preprocess(). To clean the text

We will remove the digits from the text with the help of isdigit() and join the text into a nonum variable.

nonum = ''.join([i for i in text if not i.isdigit()])

We will lower case all the tokens and lemmatized the tokens with the help of POS tagging where we will only take tokens which are PRON, storing the lemmatized tokens in tokenizied_list.

tokenized_list = nlp(nonum)
# Lemmatizing each token and converting each token into lowercase
tokenized_list = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokenized_list ]   

After removing digits and lemmatizing the tokens we will remove stop words from the text. Checking the tokenized_list and only selecting tokens which do not have stop words. Now we will remove the punctuations from the text

tokenized_list = [ word for word in tokenized_list if word not in stop_words and word not in punctuations ]

Joining all the code for preprocessing

#stop words
punctuations = string.punctuation
stop_words = spacy.lang.en.stop_words.STOP_WORDS  


def preprocess(text):
    #removing the digits
    nonum = ''.join([i for i in text if not i.isdigit()])
    # Creating our token object, which is used to create documents with linguistic annotations.
    tokenized_list = nlp(nonum)
    # Lemmatizing each token and converting each token into lowercase
    tokenized_list = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokenized_list ]   
    # Removing stop words
    tokenized_list = [ word for word in tokenized_list if word not in stop_words and word not in punctuations ]
    # return a preprocessed list of tokens
    return tokenized_list

Let’s take a string to check if the function is working properly or not

preprocess("Hello1 ! I am a good boy") 

# Output
['hello', 'good', 'boy']

Now after completing all the processes for cleaning we will initiate the MultinomialNB() classifier. Fitting the whole pipeline for phrase and sentiment into the multinomialnb classifier model.

classifier = MultinomialNB()
tfidf_vector = TfidfVectorizer(tokenizer = preprocess)

# Create pipeline 
pipe = Pipeline([('tfidf',tfidf_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(df['Phrase'],df['Sentiment'])

Now we will predict the output.

Pred = pipe.predict(df['Phrase'])
print(classification_report(df['Sentiment'], Pred)) 
Image 325

Complete code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import string 
from spacy.lang.en.stop_words import STOP_WORDS  
import pandas as pd
import plotly.express as px

df = pd.read_table(r'D:\Blogs\Internshala\train.tsv')
df[:2]

#stop words
punctuations = string.punctuation
stop_words = spacy.lang.en.stop_words.STOP_WORDS  

def preprocess(text):
    #removing the digits
    nonum = ''.join([i for i in text if not i.isdigit()])
    # Creating our token object, which is used to create documents with linguistic annotations.
    tokenized_list = nlp(nonum)
    # Lemmatizing each token and converting each token into lowercase
    tokenized_list = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokenized_list ]   
    # Removing stop words
    tokenized_list = [ word for word in tokenized_list if word not in stop_words and word not in punctuations ]
    # return a preprocessed list of tokens
    return tokenized_list

# classifier
classifier = MultinomialNB()
tfidf_vector = TfidfVectorizer(tokenizer = preprocess)

# Create pipeline 
pipe = Pipeline([('tfidf',tfidf_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(df['Phrase'],df['Sentiment'])

# predict
Pred = pipe.predict(df['Phrase'])
print(classification_report(df['Sentiment'], Pred)) 

Final Conclusion

We saw different types of sentiment analysis and how to work with Fine-Grained sentiment analysis.