Data Science, Machine Learning and Predictive Analytics: R: Gradient Boosted Machine to Predict MPG for 2019 Vehicles

Continuing on the below post, I am going to use a gradient boosted machine model to predict combined miles per gallon for all 2019 motor vehicles.

Part 1: Using Decision Trees and Random Forest to Predict MPG for 2019 Vehicles

The raw data is located on the EPA government site

The variables/features I am using for the models are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve.

There are 1253 vehicles in the dataset (does not include pure electric vehicles) summarized below.

fuel_economy_combined    eng_disp        num_cyl       transmission
 Min.   :11.00         Min.   :1.000   Min.   : 3.000   A  :301     
 1st Qu.:19.00         1st Qu.:2.000   1st Qu.: 4.000   AM : 46     
 Median :23.00         Median :3.000   Median : 6.000   AMS: 87     
 Mean   :23.32         Mean   :3.063   Mean   : 5.533   CVT: 50     
 3rd Qu.:26.00         3rd Qu.:3.600   3rd Qu.: 6.000   M  :148     
 Max.   :58.00         Max.   :8.000   Max.   :16.000   SA :555     
                                                        SCV: 66     
   num_gears                      air_aspired_method
 Min.   : 1.000   Naturally Aspirated      :523     
 1st Qu.: 6.000   Other                    :  5     
 Median : 7.000   Supercharged             : 55     
 Mean   : 7.111   Turbocharged             :663     
 3rd Qu.: 8.000   Turbocharged+Supercharged:  7     
 Max.   :10.000                                     
                                                    
                 regen_brake   batt_capacity_ah 
             No        :1194   Min.   : 0.0000  
 Electrical Regen Brake:  57   1st Qu.: 0.0000  
 Hydraulic Regen Brake :   2   Median : 0.0000  
                               Mean   : 0.3618  
                               3rd Qu.: 0.0000  
                               Max.   :20.0000  
                                                
                     drive    cyl_deactivate
 2-Wheel Drive, Front   :345  Y: 172
 2-Wheel Drive, Rear    :345  N:1081
 4-Wheel Drive          :174  
 All Wheel Drive        :349  
 Part-time 4-Wheel Drive: 40  
                              
                              
                                      fuel_type   
 Diesel, ultra low sulfur (15 ppm, maximum): 28           
 Gasoline (Mid Grade Unleaded Recommended) : 16           
 Gasoline (Premium Unleaded Recommended)   :298                 
 Gasoline (Premium Unleaded Required)      :320                 
 Gasoline (Regular Unleaded Recommended)   :591                 
                                                                
                                                                
 variable_valve
 N:  38        
 Y:1215

Starting with an untuned base model:

trees <- 1200
m_boosted_reg_untuned <- gbm(
  formula = fuel_economy_combined ~ .,
  data    = train,
  n.trees = trees,
  distribution = "gaussian"
)

> summary(m_boosted_reg_untuned)
                                  var     rel.inf
eng_disp                     eng_disp 41.26273684
batt_capacity_ah     batt_capacity_ah 24.53458898
transmission             transmission 11.33253784
drive                           drive  8.59300859
regen_brake               regen_brake  8.17877824
air_aspired_method air_aspired_method  2.11397865
num_gears                   num_gears  1.90999021
fuel_type                   fuel_type  1.65692562
num_cyl                       num_cyl  0.22260369
variable_valve         variable_valve  0.11043532
cyl_deactivate         cyl_deactivate  0.08441602
> boosted_stats_untuned
     RMSE  Rsquared       MAE 
2.4262643 0.8350367 1.7513331

The untuned GBM model performs better than the multiple linear regression model, but worse than the random forest.

I am going to tune the GBM by running a grid search:

#create hyperparameter grid
hyper_grid <- expand.grid(
  shrinkage = seq(.07, .12, .01),
  interaction.depth = 1:7,
  optimal_trees = 0,
  min_RMSE = 0
)

#grid search
for (i in 1:nrow(hyper_grid)) {
  set.seed(123)
  gbm.tune <- gbm(
    formula = fuel_economy_combined ~ .,
    data = train_random,
    distribution = "gaussian",
    n.trees = 5000,
    interaction.depth = hyper_grid$interaction.depth[i],
    shrinkage = hyper_grid$shrinkage[i],
  )
  
  hyper_grid$optimal_trees[i] <- which.min(gbm.tune$train.error)
  hyper_grid$min_RMSE[i] <- sqrt(min(gbm.tune$train.error))
  
  cat(i, "\n")
}

The hyper grid is 42 rows which is all combinations of shrinkage and interaction depths specified above.

> head(hyper_grid)
  shrinkage interaction.depth optimal_trees min_RMSE
1      0.07                 1             0        0
2      0.08                 1             0        0
3      0.09                 1             0        0
4      0.10                 1             0        0
5      0.11                 1             0        0
6      0.12                 1             0        0

After running the grid search, it is apparent that there is overfitting. This is something to be very careful about. I am going to run a 5 fold cross validation to estimate out of bag error vs MSE. After running the 5 fold CV, this is the best model that does not overfit:

> m_boosted_reg <- gbm(
  formula = fuel_economy_combined ~ .,
  data    = train,
  n.trees = trees,
  distribution = "gaussian",
  shrinkage = .09,
  cv.folds = 5,
  interaction.depth = 5
)

best.iter <- gbm.perf(m_boosted_reg, method = "cv")
pred_boosted_reg_ <- predict(m_boosted_reg,n.trees=1183, newdata = test)
mse_boosted_reg_ <- RMSE(pred = pred_boosted_reg, obs = test$fuel_economy_combined) ^2
boosted_stats<-postResample(pred_boosted_reg,test$fuel_economy_combined)

The fitted black curve above is MSE and the fitted green curve is the out of bag estimated error. 1183 is the optimal amount of iterations.

> pred_boosted_reg <- predict(m_boosted_reg,n.trees=1183, newdata = test)
> mse_boosted_reg <- RMSE(pred = pred_boosted_reg, obs = test$fuel_economy_combined) ^2
> boosted_stats<-postResample(pred_boosted_reg,test$fuel_economy_combined)
> boosted_stats
     RMSE  Rsquared       MAE 
1.8018793 0.9092727 1.3334459 
> mse_boosted_reg
3.246769

The tuned gradient boosted model performs better than the random forest with a MSE of 3.25 vs 3.67 for the random forest.

> summary(res)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-5.40000 -0.90000  0.00000  0.07643  1.10000  9.10000

50% of the predictions are within 1 MPG of the EPA Government Estimate.

The largest residuals are exotics and a hybrid which are the more unique data points in the dataset.

> tmp[which(abs(res) > boosted_stats[1] * 3), ] 
                  Division            Carline fuel_economy_combined pred_boosted_reg
642  HYUNDAI MOTOR COMPANY         Ioniq Blue                    58             48.5
482 KIA MOTORS CORPORATION           Forte FE                    35             28.7
39             Lamborghini    Aventador Coupe                    11             17.2
40             Lamborghini Aventador Roadster                    11             17.2