Predicting Car Prices
Beginner friendly guide to using linear regression model to predict car prices, calculate the error percentage and improve on it using feature selection.
The input dataset contains information about used cars listed on www.cardekho.com, which we found through a dataset available on Kaggle.
A quick glance at the data, gives us an idea of the columns and their datatypes. The data contains no null values and ranges from years 1992 to 2020.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4340 entries, 0 to 4339
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4340 non-null object
1 year 4340 non-null int64
2 selling_price 4340 non-null int64
3 km_driven 4340 non-null int64
4 fuel 4340 non-null object
5 seller_type 4340 non-null object
6 transmission 4340 non-null object
7 owner 4340 non-null object
dtypes: int64(3), object(5)
memory usage: 271.4+ KB
Exploration of the data
We shall start with checking the influence that the different input fields have on the selling price of a car.
Influence of field “seller_type”
Number of individual sellers is the highest but Trustmark dealers are selling the cars for the highest price.
Let’s visualize the effect of field “owner” to the selling price
The number of respective owners and the selling price at which they are selling their graphs has a similar graph
Below graph helps us understand how the transmission type affects selling price
Number of manual cars being sold is more than automatic, but automatic cars sells at a higher price
The below plot gives us an insight into the fact that cars which are less driven sell for a higher price and newer cars have a higher value while selling.
Diesel cars sell for the most selling price followed by petrol, CNG, electric and LPG
Most of the fields which we have are categorical, we would have to convert them into numeric type data for working on them.
Individual categorical fields are broken down into numeric data for example field “fuel” is now divided to fields fuel_petrol, fuel_diesel, fuel_CNG and so on, where a 1 in colmn fuel_petrol means the car uses petrol.
Linear Regression Model
We start working on the linear regression model but before that we split our data to test and training data
Training the model and calculating percentage of error
As we do not have a very good percentage of accuracy, let us try to remove some columns and try feeding into our regression model.
We get a slightly better error percentage,by removing fields like ‘fuel_LPG’,’fuel_Electric’ and ‘year’
Let us further try to improve on the accuracy of our model by feature selection, the criteria for it would be based on the EDA we have done before
By using feature selection we got our error percentage down from 74% to 65%.