Machine Learning -CLUSTERING on Used Cars Data

amalpappalilramesh
Mar 6, 2022
8 min read

Updated: Mar 10, 2022

1. Domain Description

It has been observed that there has been a continuous increase in the used cars indus try since the past few years. A trend can be seen in which the buyers are attracted to used cars than new ones due to its aﬀordability. The domain I have chosen is used cars -Automotive Domain: Used Cars in India and I will be carrying out different experiments as a part of this experiment.

During the last decade, the production of automobiles has increased dramatically. It appears that people naturally desire to have and use cars, as it is attached to the symbol of status, self-aﬃrmation, power. (CarDekho.com, 2018)

With the increased use of cars, the used car market has also risen as used cars are cheaper than new ones. Used car sellers and online portals are taking advantage of the increase in demand for used cars to list unrealistic prices on used cars. As a result, it becomes necessary to estimate the used car's price to help customers using Machine learning algorithms and techniques.

2. Problem Definition

This assessment report aims in detecting features that impact prediction the price of used cars, and experiments are performed to investigate an optimal algorithm for price prediction of used cars.

Because prices are usually determined by many unique features and factors, accurate car price prediction requires expert knowledge. The most significant ones are usually the brand and model, age, power & mileage. The price of the car will also be affected by features the number of doors, transmission type, safety, ac, etc. In this report, I will apply different methods and techniques in order to achieve higher precision of the used car price prediction.

The main steps involved would be the following:

3. Data Set Description

The data set in scope is an extract from Kaggle (Cardekho.com). The details are:

Data Set View

Fig. 3.1-Dataset View

We can get a detailed information on the dataset using the following command in the collab note book - CarsDF.info()

Fig 3.2 – Dataset Info

Upon detailed analysis of all the columns, the data values in each of the columns I have segregated the dataset according to the mentioned types as below:

4. Data Set Exploration – Wrangling and Cleaning

Data pre-processing is performed to balance the dataset by removing null and missing values in which the size of the instance is reduced. The missing values in the dataset are handled by dropping out the instances. As the selected dataset is massive, dropping a few cases doesn’t eﬀect the performance of the models.[6]

1. Drop the unwanted columns from the data frame

First step would be to drop the columns which does not add any values to the data. These has been identified earlier and now it can be removed with below python code.

2. Get the null count details of the columns:

The next step would be to get the null count percentages of the new dataset from step1. I have used some python code and the output is as follows

Fig 4.1- Null value counts

Also, on deeper analysis it was found that the percentage of NaN values is Higher.

Since the percentage of null rows is too high (more than 50%), we need to analyses further before we take steps for imputing them. Hence, we check which columns contains NAN

The results showed that the Nan values can be found in all columns:

3. Remove the rows with NaN values

The next step would be to drop all the rows with NaN value.

his will reduce the data set size from 20,026 rows to 19,980 rows.

4. Manipulate the price and cost columns to show cost values

Initially we can see that the price columns had data shown with mixed units of number and text. Our purpose is to convert it to a common format which actually depicts the price of the vehicle in numbers. After perform the conversion the selling price columns looks like the:

5. Removal of Duplicate Data:

It has been noted that the dataset contains a lot of duplicate data. Before deleting duplicate rows – the rows are 19,852

Total number of duplicate rows: [157 rows x 14 columns]

After dropping duplicates: [19695 rows × 14 columns]

6. Outlier data identification

In order to identify if there are any outliers, we should consider the numbers that are less than the lower bound and greater than the upper bound. We will use winzorization [4] methods to handle with outliers data:

7. Find Correlation between the independent features:

We can use the corr() method and the result is as follows:

Pair Plot:

Detailed Correlation and the heat map experiments of the features has been performed on section 6.0 Feature selection.

8. Identify Skewness

Skewness plot:

In order to clean further- winsorization was performed on various columns like mileage, kms_driven(details/code in notebook)

Label Encoding - Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. [4]. Machine learning algorithms can then decide in a better way how those labels must be operated. Also, original categorical columns have been removed.

As a result of label encoding, new set of columns created which are actual encoded values for categorical variables as shown above.

Next, multicollinearity was checked using the variance_inflation_factor and found that max_power 4.939252 was getting higher VIF, hence removed the same from the dataset

We could use this cleaned dataset for Clustering and Predictive Modeling. This concludes the dataset cleaning steps and an export of the cleaned data has been taken for further usage.

5. EDA – Exploratory Data Analysis

EDA 1 - Used cars which are sold through Dealers and Non-Dealers:

Fig 5.1 EDA 1

Observation - most of the used cars sold are through Dealers than individuals.

EDA 2- Used cars- Seller Type vs Selling Price

Fig 5.2 – EDA

Observation - Selling Price of cars seems to have higher prices when sold by Dealers when compared to Individuals. Additionally, most Car are sold generally through a dealer.

EDA 3 - Type of Transmission vs Selling Price

The used cars can be divided to 2 types based on the transmission types – Automatic cars and Manual cars

Figure 5.3 – EDA3

Observation - Automatic Cars have higher Selling Price than Manual cars

EDA -4 - Fuel Type vs Selling Price

Fig 5.4 – EDA4

Electric cars have higher selling price than all the rest of type of cars. The Plot shows only few CNG and LPG car are sold and the trend of electric cars is slowly developing.

EDA 5 -Most and least number of cars sold w.r.t Age of car.

For finding the age of the car from the dataset – we will need to add a new custom column called Age.

Cars which are most sold are 4-year-old and the cars which are least sold are 21-year-old.

Fig 5.5 -EDA 5

Additional EDA have been performed in the notebook, but limiting explanation due to the word’s limitation.

6. Experiments (three machine learning techniques) and evaluations.

6.0 Feature Selection:

One of the most important aspects when implementing any ML application is the feature selection as it helps to ﬁnd the right features that aﬀect the target variable and avoid false predictions. It is a mechanism of reducing the dimensions of the dataset by removing some features that show a zero or negative dependency on the target variable and thus improves the performance of the predictive model which in turn also reduces the computational cost of modelling. These feature selections can be arrived by measuring the correlation between various features to study the dependency among the various features.[7]

In this experiment, correlation coeﬃcient is used to measure the strength of the linear correlation between two data variables. The correlation measuresrange between -1 and 1.

Fig 6.0 - Heat Map of Correlation of Features

The above, represents the correlation values between the features (columns). It ranges between -1 to 1, where both X-axis and Y-axis indicates the Features. The yellow box represents the complete dependency of one variableon another variable,and it is between the self-variable. Orangeand Dark yellowcolor represents a positive correlation between one variable to another variable.

Positive Correlation - If the correlation is 1 - the value of one variable increases with an increase in the value of a second variable (directly proportional).

Negative Correlation - If the correlation is -1: the value of one variable increases with a decrease in the value of the second variable (inversely proportional).

No Linear Correlation - If the correlation is 0: there is no linear correlation be- tween the two variables [10]

6.1 Problem 1 - Clustering Analysis

Clustering is the unsupervised learning process of dividing the entire data into groups/clusters based on the patterns in the data. A Clustering Algorithm tries to analyse natural groups of data on the basis of some similarity. It locates the centroid of the group of data points. To carry out effective clustering, the algorithm evaluates the distance between each point from the centroid of the cluster. [2][3]

Fig 6.1 – Clustering Overview

Clustering methods are divided into 2 types: hierarchical and partitional clustering.Partitional Clustering is simply a division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. Hierarchical clustering builds a cluster hierarchy that can be represented as a tree of clusters. Each cluster can be represented as child, a parent and a sibling to other clusters. [1]

What is K-means Clustering?

K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. It is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.[3] (Garbade, 2018)

The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset. (Bu, 2018)

Fig 6.2 – Clustering Flow Chart

K-means Clustering Method:

If k is given, the K-means algorithm can be executed in the following steps [3]:

Fig 6.3 – K-means flow

So now, I am going to apply the same concepts to the current dataset in scope.

Hence we will be proceeding with the cleaned dataset obtained after the data cleaning mentioned in section 4 .

train-test-split Function

In order to test a model using our dataset, first that particular model must be trained against the same dataset. This can be achieved using a function from Sklearn model called the “train-test-split” for splitting a dataset into two subsets. One is for training, and another is for testing the model. This function automatically divides data into subsets. By default, train-test-split takes random partitions to split into two subsets of the data

1. x - store the dataset for training,

2. y - store the dataset for testing.

3. train-size – set the size of the dataset for training

4. test-size – set the size of the dataset for testing

In order to perform clustering of K-means – first is to perform a normalization using the MinMaxScalar methods so that the features are transformed to similar scalar values.

Elbow Method:

The elbow method is used to determine the optimal number of clusters in a k-means clustering. For each value of K, we are calculating WCSS (Within-Cluster Sum of Square.)

WCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. [3]

Implementation:

The following code results in representation of the graphical elbow:

Fig 6.4 - Elbow

Now the next step is to go ahead and fit the model using K-means.

Then we append the prediction as shown below and merge the dataset

Finally, we get the various clusters grouped into a new column called ClusterNo:

Fig 6.6 -Clusters

Our Algorithm has defined us 3 different clusters based on the features. More clustering output can be seen in the submitted collab notebook which specifies the various properties of the clusters.

6.1.2 Model Evaluation of Clustering: Silhouette Score

Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1.

Silhoutte Score for clustering is 0.57 which means we have got a descent clustering since the value is above 0.50. Moreover, the scores for the training data and test data are similar. the cluster Analysis - Silhouette Score was calculated which yielded a score of 0.78 and proves the accuracy of the clusters

9. References

[1] Prateek Majumder (2021) K-Means clustering with Mall Customer Segmentation Data-Analytics Vidhya Blog[blog]. Available at https://www.analyticsvidhya.com/blog/2021/05/k-means-clustering-with-mall-customer-segmentation-data-full-detailed-code-and-explanation/ [Accessed 27 Dec 2021]

[2] Basil Saji (2021) In-depth Intuition of K-Means -Analytics Vidhya Blog[blog]. Available at https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/ [Accessed 04 Jan 2022]

[3] Edureka (2020) Understanding K-means Clustering with Examples- Blog[blog]. Available at https://edureka.co/blog/k-means-clustering/ [Accessed 05 Jan 2022]

[4] Chayan Kathuria (2020) The Machine Learning Project Template- Blog[blog]. Available at https://towardsdatascience.com/a-basic-machine-learning-project-template-bf0ade0941d3 [Accessed 05 Jan 2022]

[5] Afroz Chakure (2019) Random Forest and its implementation Blog[blog]. Available at https://medium.com/swlh/random-forest-and-its-implementation-71824ced454f/ [Accessed 09 Jan 2022]

[6] Harshita Singh (2020) Data Preprocessing- Blog[blog]. Available at https://towardsdatascience.com/data-preprocessing-e2b0bed4c7fb/ [Accessed 05 Jan 2022]

[7] Car Dekho (2021) Car Dekho.com Available at https://www.cardekho.com/ [Accessed 05 Jan 2022]

[8] sklearn.linear_model.LinearRegression — scikit-learn 0.24.2 documentation.

[9] sklearn.model_selection.train_test_split — scikit-learn 0.24.2 documentation.