R XGBoost Regression

In the previous posts, I used popular machine learning algorithms to fit models to best predict MPG using the cars_19 dataset.  It was discovered that support vector machine produced the lowest RMSE.  In this post I am going to use XGBoost to build a predictive model and compare the RMSE to the other models.

The raw data is located on the EPA government site.

Similar to the other models, the variables/features I am using are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve.  Unlike the other models, the XGBoost package does not handle factors so I will have to transform them into dummy variables.  After creating the dummy variables, I will be using 33 input variables.

str(cars_19)
'data.frame':    1253 obs. of  12 variables:
 $ fuel_economy_combined: int  21 28 21 26 28 11 15 18 17 15 ...
 $ eng_disp             : num  3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ...
 $ num_cyl              : int  6 4 8 4 4 16 8 8 8 8 ...
 $ transmission         : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ...
 $ num_gears            : int  9 6 8 7 8 7 8 8 8 7 ...
 $ air_aspired_method   : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ...
 $ regen_brake          : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ...
 $ batt_capacity_ah     : num  4.25 0 0 0 0 0 0 0 0 0 ...
 $ drive                : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ...
 $ fuel_type            : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ...
 $ cyl_deactivate       : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ...
 $ variable_valve       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...

After getting a working model and performing trial and error exploratory analysis to estimate the eta and tree depth hyperparameters, I am going to run a grid search.  I am going to run 64 XGBoost models

#create hyperparameter grid
hyper_grid <- expand.grid(max_depth = seq(3, 6, 1), eta = seq(.2, .35, .01))  

Using a for loop and a 5 fold CV

for (j in 1:nrow(hyper_grid)) {
  set.seed(123)
  m_xgb_untuned <- xgb.cv(
    data = train[, 2:34],
    label = train[, 1],
    nrounds = 1000,
    objective = "reg:squarederror",
    early_stopping_rounds = 3,
    nfold = 5,
    max_depth = hyper_grid$max_depth[j],
    eta = hyper_grid$eta[j]
  )
  
  xgb_train_rmse[j] <- m_xgb_untuned$evaluation_log$train_rmse_mean[m_xgb_untuned$best_iteration]
  xgb_test_rmse[j] <- m_xgb_untuned$evaluation_log$test_rmse_mean[m_xgb_untuned$best_iteration]
  
  cat(j, "\n")
}    

ETA of .25 and max tree depth of 6 produces the model with the lowest test RMSE

I am going to run this combination below:

m1_xgb <-
  xgboost(
    data = train[, 2:34],
    label = train[, 1],
    nrounds = 1000,
    objective = "reg:squarederror",
    early_stopping_rounds = 3,
    max_depth = 6,
    eta = .25
  )   
RMSE      Rsquared   MAE
1.7374    0.8998     1.231

Graph of features that are most explanatory:

XGBoost Feature Importance

 Graph of first 3 trees:


Residuals:


Fit:


Comparison of RMSE:

svm = .93
XGBoost = 1.74
gradient boosting = 1.8
random forest = 1.9
neural network = 2.06
decision tree = 2.49
mlr = 2.6



R Robustreg Package Downloads

I built robustreg in 2006 and at the time the major stat packages did not have a robust regression available.  Below are graphs of weekly and cumulative downloads from just the RStudio mirror.  I would estimate total downloads at over 150,000.



The median_rcpp() function is written in C++ and is multiple times faster than the R base function median().

> r_norm<- rnorm(1000000)
 > system.time(median(r_norm))
   user  system elapsed 
  0.040   0.004   0.044 
> system.time(median_rcpp(r_norm))
   user  system elapsed 
  0.011   0.000   0.011 
 > median(r_norm)
[1] -0.001214243
> median_rcpp(r_norm)
[1] -0.001214243