Machine Learning - Used Cars Dataset- Predicting Used car prizes and Models comparison
- amalpappalilramesh
- Mar 6, 2022
- 9 min read

1. Domain Description
It has been observed that there has been a continuous increase in the used cars indus try since the past few years. A trend can be seen in which the buyers are attracted to used cars than new ones due to its affordability. The domain I have chosen is used cars -Automotive Domain: Used Cars in India and I will be carrying out different experiments as a part of this assessment.
During the last decade, the production of automobiles has increased dramatically. It appears that people naturally desire to have and use cars, as it is attached to the symbol of status, self-affirmation, power. (CarDekho.com, 2018)
With the increased use of cars, the used car market has also risen as used cars are cheaper than new ones. Used car sellers and online portals are taking advantage of the increase in demand for used cars to list unrealistic prices on used cars. As a result, it becomes necessary to estimate the used car's price to help customers using Machine learning algorithms and techniques.
About Car Dekho
CarDekho.com is India's leading car website that helps users buy cars. The website carries rich automotive content such as expert reviews, detailed specs and prices, comparisons as well as videos and pictures of all car brands and models available in India. (CarDekho.com, 2018)
The company has tie-ups with many auto manufacturers, more than 4000 car dealers and numerous financial institutions to facilitate the purchase of vehicles. The platform also has used car classifieds wherein users can upload their cars for sale, and find used cars for buying from individuals and used car dealers.[8]
2. Problem Definition
This assessment report aims in detecting features that impact prediction the price of used cars, and experiments are performed to investigate an optimal algorithm for price prediction of used cars.
Because prices are usually determined by many unique features and factors, accurate car price prediction requires expert knowledge. The most significant ones are usually the brand and model, age, power & mileage. The price of the car will also be affected by features the number of doors, transmission type, safety, ac, etc. In this report, I will apply different methods and techniques in order to achieve higher precision of the used car price prediction.
The main steps involved would be the following:

3. Data Set Description
The data set in scope is an extract from Kaggle (Cardekho.com). The details are:

Data Set View

Fig. 3.1-Dataset View
We can get a detailed information on the dataset using the following command in the collab note book - CarsDF.info()

Fig 3.2 – Dataset Info
Upon detailed analysis of all the columns, the data values in each of the columns I have segregated the dataset according to the mentioned types as below:

4. Data Set Exploration – Wrangling and Cleaning
Data pre-processing is performed to balance the dataset by removing null and missing values in which the size of the instance is reduced. The missing values in the dataset are handled by dropping out the instances. As the selected dataset is massive, dropping a few cases doesn’t effect the performance of the models.[6]
1. Drop the unwanted columns from the data frame
First step would be to drop the columns which does not add any values to the data. These has been identified earlier and now it can be removed with below python code.

2. Get the null count details of the columns:
The next step would be to get the null count percentages of the new dataset from step1. I have used some python code and the output is as follows

Fig 4.1- Null value counts
Also, on deeper analysis it was found that the percentage of NaN values is Higher.

Since the percentage of null rows is too high (more than 50%), we need to analyses further before we take steps for imputing them. Hence, we check which columns contains NAN

The results showed that the Nan values can be found in all columns:
3. Remove the rows with NaN values
The next step would be to drop all the rows with NaN value.

his will reduce the data set size from 20,026 rows to 19,980 rows.
4. Manipulate the price and cost columns to show cost values
Initially we can see that the price columns had data shown with mixed units of number and text. Our purpose is to convert it to a common format which actually depicts the price of the vehicle in numbers. After perform the conversion the selling price columns looks like the:
5. Removal of Duplicate Data:
It has been noted that the dataset contains a lot of duplicate data. Before deleting duplicate rows – the rows are 19,852

Total number of duplicate rows: [157 rows x 14 columns]

After dropping duplicates: [19695 rows × 14 columns]
6. Outlier data identification
In order to identify if there are any outliers, we should consider the numbers that are less than the lower bound and greater than the upper bound. We will use winzorization [4] methods to handle with outliers data:

7. Find Correlation between the independent features:
We can use the corr() method and the result is as follows:

Pair Plot:

Detailed Correlation and the heat map experiments of the features has been performed on section 6.0 Feature selection.
8. Identify Skewness

Skewness plot:

In order to clean further- winsorization was performed on various columns like mileage, kms_driven(details/code in notebook)

Label Encoding - Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. [4]. Machine learning algorithms can then decide in a better way how those labels must be operated. Also, original categorical columns have been removed.

As a result of label encoding, new set of columns created which are actual encoded values for categorical variables as shown above.

Next, multicollinearity was checked using the variance_inflation_factor and found that max_power 4.939252 was getting higher VIF, hence removed the same from the dataset

We could use this cleaned dataset for Clustering and Predictive Modeling. This concludes the dataset cleaning steps and an export of the cleaned data has been taken for further usage.
Used Car Price Prediction
The next experiment we are going to do is to predict the used cars price using 3 ML Techniques.
§ Multi Linear Regression
§ Random Forest Regressor
§ XGBoost Regressor
Next, experiment is a multi linear regression model to predict the selling price. Here also the cleaned dataset from section 4 would be used to experiment the modelling. Also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression is to model the linear relationship between the explanatory (independent) variables and response (dependent) variables. ( (Hayes, 2021)

Here we will be using the Linear Regression model from linear model from sklearn. The following code fits the dataset to the model. it also prints the regression coefficients and regression variance score.

After fitting the model along the scores obtained are:
Score type
Score
Train Score
0.5149
Test Score
0.5782
Model Evaluation:
We can go ahead and evaluate the model using the MAPE scores and the adjusted R2 scores. Details in section 7(Results)

Score type
Score
R2 Score
0.5770
MAPE
56.76
An Ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model. The aim of ensemble learning is to combine different classifiers into a meta-classifier that has better generalization performance than each individual classifier alone.
Types of Ensemble Learning:
1. Boosting.
2. Bagging
3. Stacking
Bagging - involves random sampling of small subset of data from the dataset with replacement.it makes each model run independently and then aggregates the outputs at the end without preference to any model.

Fig 6.9 – Bagging
6.2.3.1 Random Forest Regressor:
1. Supervised Learning algorithm
2. Operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
3. Combines the result of multiple predictions
In Random Forest: we sample over features and keep only a random subset of them to build the tree. This makes the trees a bit less correlated with each other’s. It also makes the decision making process more robust to missing data

Fig 6.10
The model has been build using the RandomForestRegressor from the sklearn. Ensemble.
We are using the following code to perform RandomForestRegressor


Hyperparameter tuning:
The model has been fit with 2 methods
1. Without hyper parameter tuning
2. with Hyper parameter tuning using



Cross Validation:
Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. We obtain the cross-validation scores with the cross_val_score() function from scikit-learn. We set the number of folds with the cv parameter.
Score

The scoring parameter chooses a measure of model quality to report: in this case, we chose negative mean absolute error (MAE).
6.2.3.2 XGBoost Regressor: Boosting

Fig 6.11 - Boosting
1. decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.
2. XGBoost stands for Xtreme Gradient Boosting.
3. It is a perfect combination of software and hardware optimization techniques to yield superior results using less computing resources in the shortest amount of time.
4. Optimized Gradient Boosting algorithm through parallel processing, handling missing values and regularization to avoid overfitting/bias

Fig 6.12 – XGBoost Overview
Each base learner updates the values of the observations in the dataset. As the name suggests, we combine the weak learners sequentially using Gradient Descent. This method tries to fit the new predictor to the residual errors made by the previous predictor.
After fitting the model, the scores are obtained as follows.




Cross Validation: Same code have been performed as for RandomForestRegressor
MAE scores Obtained

Average M.A.E score

Score type
Score
Average MAE
124919
The dataset is inputted to the selected ML algorithms. The models were imported from the different Sklearn modules. Each model is trained on the training dataset using the fit method, where train and test are the parameters for both datasets. The selected algorithms are regression models. The model’s efficiency is calculated and compared against model scores on both training and testing data of datasets, RE is used to estimate the absolute error between the instances.
This table 7.1 deals with the experimental results of the work, where the 3 selected ML algorithms Multi Linear Regressor (MLR), Random Forest Regressor (RFR) and XGBoost Regressor are trained and tested on the datasets. After building the model, the performance metrics model score are evaluated for both training and testing datasets and to investigate how well the model predicts the priceof the used cars.
Algorithm
Training Scores (%)
Testing Scores %
Adjusted R2 Score
MAPE

Table 7.1
All 3 models are trained and tested upon thousands of instances to get an accurate prediction. From the results, XGBoost performed well with the highest Model scores score of 98.21% on training data and 90.36% on Testing data.
A forecast system's accuracy is measured by its mean absolute percentage error (MAPE). This accuracy is measured in percentages, and it is calculated by subtracting actual values from actual values divided by actual values for each time period.

On the MAPE scores, both RFR and XGBoost provides a Mean Absolute error less that 20% à which means the model is pretty accurate.
R2 is the squared correlation between observed and predicted values. A model's R-squared is a measure of its quality and evaluation. When too many variables are included in the model, the adjusted R-squared provides a measure of its quality. Here the adjusted R2 scores are highest for XGBoost with 0.9033 , followed by the RFR(0.8997) . Hence, we can conclude that XGBoost is the best model.

The observation proofs that XGBoost Regressor model has the top Adjusted R2 score on the datasets on the 3 Models. Feature importance was calculated for this model to know the factors that significantly impacts the price of the used cars. The below represents the feature importance of all the input features

It is clear from the above graph that the mainly 4 features contribute more to the prediction of selling price out of the 10 features in scope. Out of all features the Transmission Type feature has the most significance, followed by Type of fuel.

Due to the High demand in cars in India, many are going ahead in buying used cars due to their affordability. Many websites use inappropriate algorithms for listing prices of used cars causing in huge losses and customers are also disappointed as their value for money decreases. In this report, various experiments have been performed on the used cars dataset and a comparison matrix is evolved in the results section. The various factors taken into consideration were the R2 Scores, MAPE scores and Model scores for all the various Algorithms experimented.

From above observations, XGBoost is considered as optimal algorithm based on :
1. For the dataset it obtained 90.36 % on test data.
2. For the Adjusted R2 – It is obtained 0.9028 and is the highest score among the rest
3. For the Mean Absolute Percentage Error à it is 18.58% which is lesser than 20% and proves that the Model is most accurate!
With all the above points taken, this model could be extremely useful to predict the selling price of the used cars and will help the used cars company.
Comments