top of page
Search

Twitter Sentiment Analysis using NLP and Machine Learning

  • amalpappalilramesh
  • May 16, 2022
  • 3 min read

1. Introduction

The main objective of this blog is to train a machine learning model and predict the sentiment (positive/ negative) for a series of twitter tweets using the best accurate model. High Level Steps are shown below:




Modules Used:


Here for developing the same , we would need to import some libraries . Here for this blog , I am using a google collab notebook - which will use Python as the programming language .

Hence the step 1 is to import the necessary libraries ;




2. Load the Dataset:

Now let us have a look into the dataset in use . The following figure gives a description of the dataset . Here , my dataset is in a CSV file and the following code loads the dataset to the dataframe in python . The dataset is a collection of real twitter tweets .


The dataset has the following columns




Here is the sample data from the dataset


Now let us have a quick look at the distribution of tweets which are labelled as negative and positive respectively :


3. Data Preprocessing:

This section will deal with preprocessing the data . There are few methods used to preprocess the data before we pass it to the NLP engines.


1. Check NAN - Check the Null values in data set and drop them all . the following code does it


2. Lowercasing - Convert all tweets to lowercase



3. Removal of repeating-characters - It has been identified that there are several repeating ccharecters in the dataset . We need to remove them all




4. Removal of URL’s/Links - The tweets also contained several irrelevant links , URL's which we wont be needing for the prediction . hence we are removing them too



5. Cleaning the Numbers





6. Perform Tokenization - with TweetTokenizer


One of the important step here is to perform tokenization of the tweets. here we will be using a Tweet tokenizer method as shown below :




Insert the tokenized text to Tokenized_Text column



7. Lemmatization

After the tokenization, another major tweet preprocessing step is lemmatization . The following block of code performs lemmatization :



Now let us Plot the words in negative statements





8. Text feature extraction and model Generation


Once we are done with data cleaning - we can go ahead with Model generation . The following steps need to be performed

1. Split the data to training-test



We create 3 train and test set pairs using the columns: 'Text', 'Tokenised_Text' and Lemmatised_Text'.



4. MODEL SUMMARY

· A total of 11 Models tried out

· Different combinations of Preprocessing performed

· Highest accuracy found for model M8 which is a SVM model with 87% accuracy






Summary :




· Out of the 11 different models – SVM with following combination performed with highest accuracy

· SVM takes a probabilistic approach and works on the geometric interpretation of the problems.

· The model is independent of dimensions


Test Predictions on the Test Dataset:

Now that we have identified the best model , we will apply this model to predict the Sentiments for the test dataset .




The results show that

èSVMs consistently achieve good performance on text categorization tasks,

èSVMs outperform existing methods substantially and signicantly.

è SVMs eliminate the need for feature selection,

è SVMs is more robust when compared to other methods




5. Final Prediction:

Copy the predictions to Sentiment Column in the dataset and export it as CSV.


The Sentiment column provides the predicted sentiment by the model.





6. REFERENCE:






 
 
 

Comments


Post: Blog2_Post
  • Facebook
  • Twitter
  • LinkedIn

©2021 by techblog.ai.bcu.amal. Proudly created with Wix.com

bottom of page