top of page
Search

Twitter Sentiment Analysis using NLP and Machine Learning

  • amalpappalilramesh
  • May 16, 2022
  • 3 min read

1. Introduction

The main objective of this blog is to train a machine learning model and predict the sentiment (positive/ negative) for a series of twitter tweets using the best accurate model. High Level Steps are shown below:



ree

Modules Used:


Here for developing the same , we would need to import some libraries . Here for this blog , I am using a google collab notebook - which will use Python as the programming language .

Hence the step 1 is to import the necessary libraries ;



ree

2. Load the Dataset:

Now let us have a look into the dataset in use . The following figure gives a description of the dataset . Here , my dataset is in a CSV file and the following code loads the dataset to the dataframe in python . The dataset is a collection of real twitter tweets .


ree

The dataset has the following columns


ree


Here is the sample data from the dataset

ree

Now let us have a quick look at the distribution of tweets which are labelled as negative and positive respectively :


ree

3. Data Preprocessing:

This section will deal with preprocessing the data . There are few methods used to preprocess the data before we pass it to the NLP engines.


1. Check NAN - Check the Null values in data set and drop them all . the following code does it

ree

2. Lowercasing - Convert all tweets to lowercase


ree

3. Removal of repeating-characters - It has been identified that there are several repeating ccharecters in the dataset . We need to remove them all


ree


4. Removal of URL’s/Links - The tweets also contained several irrelevant links , URL's which we wont be needing for the prediction . hence we are removing them too


ree

5. Cleaning the Numbers


ree



6. Perform Tokenization - with TweetTokenizer


One of the important step here is to perform tokenization of the tweets. here we will be using a Tweet tokenizer method as shown below :


ree


Insert the tokenized text to Tokenized_Text column


ree

7. Lemmatization

After the tokenization, another major tweet preprocessing step is lemmatization . The following block of code performs lemmatization :


ree


Now let us Plot the words in negative statements


ree



8. Text feature extraction and model Generation


Once we are done with data cleaning - we can go ahead with Model generation . The following steps need to be performed

1. Split the data to training-test


ree

We create 3 train and test set pairs using the columns: 'Text', 'Tokenised_Text' and Lemmatised_Text'.



4. MODEL SUMMARY

· A total of 11 Models tried out

· Different combinations of Preprocessing performed

· Highest accuracy found for model M8 which is a SVM model with 87% accuracy




ree


Summary :


ree


· Out of the 11 different models – SVM with following combination performed with highest accuracy

· SVM takes a probabilistic approach and works on the geometric interpretation of the problems.

· The model is independent of dimensions


Test Predictions on the Test Dataset:

Now that we have identified the best model , we will apply this model to predict the Sentiments for the test dataset .


ree


The results show that

èSVMs consistently achieve good performance on text categorization tasks,

èSVMs outperform existing methods substantially and signicantly.

è SVMs eliminate the need for feature selection,

è SVMs is more robust when compared to other methods




5. Final Prediction:

Copy the predictions to Sentiment Column in the dataset and export it as CSV.


The Sentiment column provides the predicted sentiment by the model.

ree




6. REFERENCE:






 
 
 

Comments


Post: Blog2_Post
  • Facebook
  • Twitter
  • LinkedIn

©2021 by techblog.ai.bcu.amal. Proudly created with Wix.com

bottom of page