Part 1: Using Decision Trees and Random Forest to Predict MPG for 2019 Vehicles
Part 2: Using Gradient Boosted Machine to Predict MPG for 2019 Vehicles
The raw data is located on the EPA government site
The variables/features I am using for the models are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve.
There are 1253 vehicles in the dataset (does not include pure electric vehicles) summarized below.
fuel_economy_combined eng_disp num_cyl transmission
Min. :11.00 Min. :1.000 Min. : 3.000 A :301
1st Qu.:19.00 1st Qu.:2.000 1st Qu.: 4.000 AM : 46
Median :23.00 Median :3.000 Median : 6.000 AMS: 87
Mean :23.32 Mean :3.063 Mean : 5.533 CVT: 50
3rd Qu.:26.00 3rd Qu.:3.600 3rd Qu.: 6.000 M :148
Max. :58.00 Max. :8.000 Max. :16.000 SA :555
SCV: 66
num_gears air_aspired_method
Min. : 1.000 Naturally Aspirated :523
1st Qu.: 6.000 Other : 5
Median : 7.000 Supercharged : 55
Mean : 7.111 Turbocharged :663
3rd Qu.: 8.000 Turbocharged+Supercharged: 7
Max. :10.000
regen_brake batt_capacity_ah
No :1194 Min. : 0.0000
Electrical Regen Brake: 57 1st Qu.: 0.0000
Hydraulic Regen Brake : 2 Median : 0.0000
Mean : 0.3618
3rd Qu.: 0.0000
Max. :20.0000
drive cyl_deactivate
2-Wheel Drive, Front :345 Y: 172
2-Wheel Drive, Rear :345 N:1081
4-Wheel Drive :174
All Wheel Drive :349
Part-time 4-Wheel Drive: 40
fuel_type
Diesel, ultra low sulfur (15 ppm, maximum): 28
Gasoline (Mid Grade Unleaded Recommended) : 16
Gasoline (Premium Unleaded Recommended) :298
Gasoline (Premium Unleaded Required) :320
Gasoline (Regular Unleaded Recommended) :591
variable_valve
N: 38
Y:1215
Starting with an untuned base model:
set.seed(123)
m_svm_untuned <- svm(formula = fuel_economy_combined ~ .,
data = test)
pred_svm_untuned <- predict(m_svm_untuned, test)
yhat <- pred_svm_untuned
y <- test$fuel_economy_combined
svm_stats_untuned <- postResample(yhat, y)
svm_stats_untuned
RMSE Rsquared MAE
2.3296249 0.8324886 1.4964907
Similar to the results for the untuned boosted model. I am going to run a grid search and tune the support vector machine.
hyper_grid <- expand.grid(
cost = 2^seq(-5,5,1),
gamma= 2^seq(-5,5,1)
)
e <- NULL
for(j in 1:nrow(hyper_grid)){
set.seed(123)
m_svm_untuned <- svm(
formula = fuel_economy_combined ~ .,
data = train,
gamma = hyper_grid$gamma[j],
cost = hyper_grid$cost[j]
)
pred_svm_untuned <-predict(
m_svm_untuned,
newdata = test
)
yhat <- pred_svm_untuned
y <- test$fuel_economy_combined
e[j] <- postResample(yhat, y)[1]
cat(j, "\n")
}
which.min(e) #minimum MSE
The best tuned support vector machine has a cost of 32 and a gamma of .25.
I am going to run this combination:
set.seed(123)
m_svm_tuned <- svm(
formula = fuel_economy_combined ~ .,
data = test,
gamma = .25,
cost = 32,
scale=TRUE
)
pred_svm_tuned <- predict(m_svm_tuned,test)
yhat<-pred_svm_tuned
y<-test$fuel_economy_combined
svm_stats<-postResample(yhat,y)
svm_stats
RMSE Rsquared MAE
0.9331948 0.9712492 0.7133039
The tuned support vector machine outperforms the gradient boosted model substantially with a MSE of .87 vs a MSE of 3.25 for the gradient boosted model and a MSE of 3.67 for the random forest.
summary(m_svm_tuned)
Call:
svm(formula = fuel_economy_combined ~ ., data = test, gamma = 0.25, cost = 32, scale = TRUE)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 32
gamma: 0.25
epsilon: 0.1
Number of Support Vectors: 232
sum(abs(res)<=1) / 314
[1] 0.8503185
The model is able to predict 85% of vehicles within 1 MPG of EPA estimate. Considering I am not rounding this is a great result.
The model also does a much better job with outliers as none of the models predicted the Hyundai Ioniq well.
tmp[which(abs(res) > svm_stats[1] * 3), ] #what cars are 3 se residuals
Division Carline fuel_economy_combined pred_svm_tuned
641 HYUNDAI MOTOR COMPANY Ioniq 55 49.01012
568 TOYOTA CAMRY XSE 26 22.53976
692 Volkswagen Arteon 4Motion 23 26.45806
984 Volkswagen Atlas 19 22.23552