After cleaning the data so that all car models were listed under the name of the brand (e.g. Volkwagen Golf 2007 -> VW) I plotted the distribution of car models. The bar chart shows cleanly which car brand has the most number of models (Toyota) and which has the least (Mercury). We can see that most car brands have around 10 models but this plot is not a good visualisation of averages.
Here I plot the density of car models, we can see the average number of car models between all brands more clearly. There is also a small peak at 32 models which we know to be Toyota from the bar chart above. The average number of models per car brand is approximately ~ 5 models.
Since we are interested in the price of the car we should plot the density of car prices to get an understanding of the average price of a car. We can see most cars are £10000, the curve drops and then flattens around £20000 where a few more cars sit, the curve then flattens towards £50000 as we expect.
Firstly, I started with a model that includes all variables. This way I can plot all residuals to detect outliers and more importantly determine if the data is suitable for multiple linear regression.
Overall it appears our model is a good fit and suitable for prediction using Multiple Linear Regression. However, we need to deal with the outlier identified and we can also make our model simpler without reducing the accuracy of the model. Using Backward Elimination or Forward Selection we can remove insignificant variables from the model and build a simpler one to use for prediction.
Comparing the two models, the backward model appears to have more significant variables as well as having a higher R-squared value, suggesting it is the better model. The F-test below shows that the F-statistic is significant. Hence I have evidence to reject the null hypothesis that the models are not significantly different. We have enough evidence to say that the backward model is a better predictor than the forward model.
Now lets compare the backward model to the original model we started with to make sure that performing a backward elimination hasn’t made the model worse for prediction. Performing another F-test I found that the F-statistic was not significant, thus I failed to reject the null hypothesis that the models are not significantly different. This suggests that the backward model is not significantly affecting our prediction of car prices compared to the original model, this is a good sign!
Finally, I looked at the AIC values for both the original and backward model. The backward model AIC was 3480.483 whilst the original model AIC was 3487.939, the original model AIC is slightly higher implying our backward model is infact a better fit. So we naturally choose the backward model. I then plotted the residuals of the backward model:
As you can see it’s almost identical to the original model’s residuals except that the outlier is now gone which is great.
Now I have built and optimised the model, I could then test and plot the accuracy. Below is the plot of the predicted price made by my model against the actual price of the car, as you can see many of the points lay on the diagonal red line suggesting our model is a great predictor.
I obtained a RMSE of 1585.695 on the training data, this means we have an average error of + or - 1585.695. This is expected given that we are dealing with very large numbers for price. The RMSE for the the test data which accounted for 5% of the dataset was 1447.244; this suggest we maybe slightly under-fitting the dataset for more expensive cars but not by enough to significantly impact the models accuracy.
I then obtained a number for the accuracy of our model by taking the difference of each predicted price against the actual price per car, divided it by the actual price to obtain the accuracy of each prediction and then took the average (absolute) accuracy of each prediction to obtain the average accuracy of each prediction made by our model. The result was 90.9% accuracy. This is a very good result and would be greatly benefitial to car retailers trying to estimate prices of cars based on their specifications.
The effect and significance of the variables in our model:
The most significant variables were: