Twitter Sentiment Analysis using NLP and Machine Learning
- amalpappalilramesh
- May 16, 2022
- 3 min read
1. Introduction
The main objective of this blog is to train a machine learning model and predict the sentiment (positive/ negative) for a series of twitter tweets using the best accurate model. High Level Steps are shown below:

Modules Used:
Here for developing the same , we would need to import some libraries . Here for this blog , I am using a google collab notebook - which will use Python as the programming language .
Hence the step 1 is to import the necessary libraries ;

2. Load the Dataset:
Now let us have a look into the dataset in use . The following figure gives a description of the dataset . Here , my dataset is in a CSV file and the following code loads the dataset to the dataframe in python . The dataset is a collection of real twitter tweets .

The dataset has the following columns

Here is the sample data from the dataset

Now let us have a quick look at the distribution of tweets which are labelled as negative and positive respectively :

3. Data Preprocessing:
This section will deal with preprocessing the data . There are few methods used to preprocess the data before we pass it to the NLP engines.
1. Check NAN - Check the Null values in data set and drop them all . the following code does it

2. Lowercasing - Convert all tweets to lowercase

3. Removal of repeating-characters - It has been identified that there are several repeating ccharecters in the dataset . We need to remove them all

4. Removal of URL’s/Links - The tweets also contained several irrelevant links , URL's which we wont be needing for the prediction . hence we are removing them too

5. Cleaning the Numbers

6. Perform Tokenization - with TweetTokenizer
One of the important step here is to perform tokenization of the tweets. here we will be using a Tweet tokenizer method as shown below :

Insert the tokenized text to Tokenized_Text column

7. Lemmatization
After the tokenization, another major tweet preprocessing step is lemmatization . The following block of code performs lemmatization :

Now let us Plot the words in negative statements

8. Text feature extraction and model Generation
Once we are done with data cleaning - we can go ahead with Model generation . The following steps need to be performed
1. Split the data to training-test

We create 3 train and test set pairs using the columns: 'Text', 'Tokenised_Text' and Lemmatised_Text'.
4. MODEL SUMMARY
· A total of 11 Models tried out
· Different combinations of Preprocessing performed
· Highest accuracy found for model M8 which is a SVM model with 87% accuracy

Summary :

· Out of the 11 different models – SVM with following combination performed with highest accuracy
· SVM takes a probabilistic approach and works on the geometric interpretation of the problems.
· The model is independent of dimensions
Test Predictions on the Test Dataset:
Now that we have identified the best model , we will apply this model to predict the Sentiments for the test dataset .

The results show that
èSVMs consistently achieve good performance on text categorization tasks,
èSVMs outperform existing methods substantially and signicantly.
è SVMs eliminate the need for feature selection,
è SVMs is more robust when compared to other methods
5. Final Prediction:
Copy the predictions to Sentiment Column in the dataset and export it as CSV.
The Sentiment column provides the predicted sentiment by the model.

6. REFERENCE:
Comments