Predicting Car Prices

Beginner friendly guide to using linear regression model to predict car prices, calculate the error percentage and improve on it using feature selection.

Alifia Ghantiwala
2 min readOct 30, 2020
Photo by Matt Alaniz on Unsplash

The input dataset contains information about used cars listed on www.cardekho.com, which we found through a dataset available on Kaggle.

A quick glance at the data, gives us an idea of the columns and their datatypes. The data contains no null values and ranges from years 1992 to 2020.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4340 entries, 0 to 4339
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4340 non-null object
1 year 4340 non-null int64
2 selling_price 4340 non-null int64
3 km_driven 4340 non-null int64
4 fuel 4340 non-null object
5 seller_type 4340 non-null object
6 transmission 4340 non-null object
7 owner 4340 non-null object
dtypes: int64(3), object(5)
memory usage: 271.4+ KB

Exploration of the data

We shall start with checking the influence that the different input fields have on the selling price of a car.

Influence of field “seller_type”

Number of individual sellers is the highest but Trustmark dealers are selling the cars for the highest price.

Let’s visualize the effect of field “owner” to the selling price

The number of respective owners and the selling price at which they are selling their graphs has a similar graph

Below graph helps us understand how the transmission type affects selling price

Number of manual cars being sold is more than automatic, but automatic cars sells at a higher price

The below plot gives us an insight into the fact that cars which are less driven sell for a higher price and newer cars have a higher value while selling.

Diesel cars sell for the most selling price followed by petrol, CNG, electric and LPG

Most of the fields which we have are categorical, we would have to convert them into numeric type data for working on them.

Individual categorical fields are broken down into numeric data for example field “fuel” is now divided to fields fuel_petrol, fuel_diesel, fuel_CNG and so on, where a 1 in colmn fuel_petrol means the car uses petrol.

Linear Regression Model

We start working on the linear regression model but before that we split our data to test and training data

Training the model and calculating percentage of error

As we do not have a very good percentage of accuracy, let us try to remove some columns and try feeding into our regression model.

We get a slightly better error percentage,by removing fields like ‘fuel_LPG’,’fuel_Electric’ and ‘year’

Let us further try to improve on the accuracy of our model by feature selection, the criteria for it would be based on the EDA we have done before

By using feature selection we got our error percentage down from 74% to 65%.

--

--