R Tensorflow Multiple Linear Regression

In the previous three posts I used multiple linear regression, decision trees, gradient boosting, and support vector machine to predict miles per gallon for 2019 vehicles.  It was determined that svm produced the best model.  In this post, I am going to run TensorFlow through R and fit a multiple linear regression model using the same data to predict MPG.

Part 1: Multiple Linear Regression using R

There are 1253 vehicles in the cars_19 dataset.  I am simply running mlr using Tensorflow for demonstrative purposes as using lm() in R is more efficient and more precise for such a small dataset.

TensorFlow uses an algorithm that is dependent upon convergence whereas R computes the closed form estimates of beta.  I will be using 11 features and an intercept so R will be inverting a 12 x 12 matrix which is not computationally expensive with today's technology.

The dataset below of 11 features contains 7 factor variables and 4 numeric variables. 

str(cars_19)
'data.frame':    1253 obs. of  12 variables:
 $ fuel_economy_combined: int  21 28 21 26 28 11 15 18 17 15 ...
 $ eng_disp             : num  3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ...
 $ num_cyl              : int  6 4 8 4 4 16 8 8 8 8 ...
 $ transmission         : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ...
 $ num_gears            : int  9 6 8 7 8 7 8 8 8 7 ...
 $ air_aspired_method   : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ...
 $ regen_brake          : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ...
 $ batt_capacity_ah     : num  4.25 0 0 0 0 0 0 0 0 0 ...
 $ drive                : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ...
 $ fuel_type            : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ...
 $ cyl_deactivate       : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ...
 $ variable_valve       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...

The factors need to be transformed into a format TensorFlow can understand.

cols <- feature_columns(
  column_numeric(colnames(cars_19[c(2, 3, 5, 8)])),
  column_categorical_with_identity("transmission", num_buckets = 7),
  column_categorical_with_identity("air_aspired_method", num_buckets = 5),
  column_categorical_with_identity("regen_brake", num_buckets = 3),
  column_categorical_with_identity("drive", num_buckets = 5),
  column_categorical_with_identity("fuel_type", num_buckets = 5),
  column_categorical_with_identity("cyl_deactivate", num_buckets = 2),
  column_categorical_with_identity("variable_valve", num_buckets = 2)
  )

Create an input function:

#input_fn for a given subset of data
cars_19_input_fn <- function(data, num_epochs = 1) {
  input_fn(
    data,
    features = colnames(cars_19[c(2:12)]),
    response = "fuel_economy_combined",
    batch_size = 64,
    num_epochs = num_epochs
  )
}

Train, evaluate, predict:

model <- linear_regressor(feature_columns = cols)

set.seed(123)
indices <- sample(1:nrow(cars_19), size = 0.75 * nrow(cars_19))
train <- cars_19[indices, ]
test <- cars_19[-indices, ]

#train model
model %>% train(cars_19_input_fn(train, num_epochs = 1000))

#evaluate model
model %>% evaluate(cars_19_input_fn(test))

#predict
yhat <- model %>% predict(cars_19_input_fn(test))

Results are very close to the R closed form estimates:
postResample(yhat, y)
     RMSE  Rsquared       MAE 
2.5583185 0.7891934 1.9381757 



R: SVM to Predict MPG for 2019 Vehicles

Continuing on the below post, I am going to use a support vector machine (SVM) to predict combined miles per gallon for all 2019 motor vehicles.

Part 1: Using Decision Trees and Random Forest to Predict MPG for 2019 Vehicles

Part 2: Using Gradient Boosted Machine to Predict MPG for 2019 Vehicles

The raw data is located on the EPA government site

The variables/features I am using for the models are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve. 

There are 1253 vehicles in the dataset (does not include pure electric vehicles) summarized below.
fuel_economy_combined    eng_disp        num_cyl       transmission
 Min.   :11.00         Min.   :1.000   Min.   : 3.000   A  :301     
 1st Qu.:19.00         1st Qu.:2.000   1st Qu.: 4.000   AM : 46     
 Median :23.00         Median :3.000   Median : 6.000   AMS: 87     
 Mean   :23.32         Mean   :3.063   Mean   : 5.533   CVT: 50     
 3rd Qu.:26.00         3rd Qu.:3.600   3rd Qu.: 6.000   M  :148     
 Max.   :58.00         Max.   :8.000   Max.   :16.000   SA :555     
                                                        SCV: 66     
   num_gears                      air_aspired_method
 Min.   : 1.000   Naturally Aspirated      :523     
 1st Qu.: 6.000   Other                    :  5     
 Median : 7.000   Supercharged             : 55     
 Mean   : 7.111   Turbocharged             :663     
 3rd Qu.: 8.000   Turbocharged+Supercharged:  7     
 Max.   :10.000                                     
                                                    
                 regen_brake   batt_capacity_ah 
             No        :1194   Min.   : 0.0000  
 Electrical Regen Brake:  57   1st Qu.: 0.0000  
 Hydraulic Regen Brake :   2   Median : 0.0000  
                               Mean   : 0.3618  
                               3rd Qu.: 0.0000  
                               Max.   :20.0000  
                                                
                     drive    cyl_deactivate
 2-Wheel Drive, Front   :345  Y: 172
 2-Wheel Drive, Rear    :345  N:1081
 4-Wheel Drive          :174  
 All Wheel Drive        :349  
 Part-time 4-Wheel Drive: 40  
                              
                              
                                      fuel_type   
 Diesel, ultra low sulfur (15 ppm, maximum): 28           
 Gasoline (Mid Grade Unleaded Recommended) : 16           
 Gasoline (Premium Unleaded Recommended)   :298                 
 Gasoline (Premium Unleaded Required)      :320                 
 Gasoline (Regular Unleaded Recommended)   :591                 
                                                                
                                                                
 variable_valve
 N:  38        
 Y:1215        

Starting with an untuned base model:
set.seed(123)
m_svm_untuned <- svm(formula = fuel_economy_combined ~ .,
                     data    = test)

pred_svm_untuned <- predict(m_svm_untuned, test)

yhat <- pred_svm_untuned
y <- test$fuel_economy_combined
svm_stats_untuned <- postResample(yhat, y)

svm_stats_untuned
     RMSE  Rsquared       MAE 
2.3296249 0.8324886 1.4964907 

Similar to the results for the untuned boosted model.  I am going to run a grid search and tune the support vector machine.

hyper_grid <- expand.grid(
  cost = 2^seq(-5,5,1),
  gamma= 2^seq(-5,5,1)  
)
e <- NULL

for(j in 1:nrow(hyper_grid)){
  set.seed(123)
  m_svm_untuned <- svm(
    formula = fuel_economy_combined ~ .,
    data    = train,
    gamma = hyper_grid$gamma[j],
    cost = hyper_grid$cost[j]
  )  
  
  pred_svm_untuned <-predict(
    m_svm_untuned,
    newdata = test
  )
  
  yhat <- pred_svm_untuned
  y <- test$fuel_economy_combined
  e[j] <- postResample(yhat, y)[1]
  cat(j, "\n")
}

which.min(e)  #minimum MSE

The best tuned support vector machine has a cost of 32 and a gamma of .25.

I am going to run this combination:

set.seed(123)
m_svm_tuned <- svm(
  formula = fuel_economy_combined ~ .,
  data    = test,
  gamma = .25,
  cost = 32,
  scale=TRUE
  )  

pred_svm_tuned <- predict(m_svm_tuned,test)

yhat<-pred_svm_tuned 
y<-test$fuel_economy_combined
svm_stats<-postResample(yhat,y)


svm_stats
     RMSE  Rsquared       MAE 
0.9331948 0.9712492 0.7133039 


The tuned support vector machine outperforms the gradient boosted model substantially with a MSE of .87 vs a MSE of 3.25 for the gradient boosted model and a MSE of 3.67 for the random forest.

summary(m_svm_tuned)

Call:
svm(formula = fuel_economy_combined ~ ., data = test, gamma = 0.25, cost = 32, scale = TRUE)


Parameters:
   SVM-Type:  eps-regression 
 SVM-Kernel:  radial 
       cost:  32 
      gamma:  0.25 
    epsilon:  0.1 


Number of Support Vectors:  232







sum(abs(res)<=1) / 314
[1] 0.8503185  

The model is able to predict 85% of vehicles within 1 MPG of EPA estimate. Considering I am not rounding this is a great result.

The model also does a much better job with outliers as none of the models predicted the Hyundai Ioniq well.

tmp[which(abs(res) > svm_stats[1] * 3), ] #what cars are 3 se residuals
                 Division        Carline fuel_economy_combined pred_svm_tuned
641 HYUNDAI MOTOR COMPANY          Ioniq                    55       49.01012
568                TOYOTA      CAMRY XSE                    26       22.53976
692            Volkswagen Arteon 4Motion                    23       26.45806
984            Volkswagen          Atlas                    19       22.23552