Graphing California Electricity Supply using ggplot2 during record temperatures 9/05/2022 - 09/09/2022
Raw data from CA ISO. Data is available in 5 minute increments for each 24 hour period.
Silicon Valley, CA
Graphing California Electricity Supply using ggplot2 during record temperatures 9/05/2022 - 09/09/2022
Raw data from CA ISO. Data is available in 5 minute increments for each 24 hour period.
In previous posts, I used popular machine learning algorithms to fit models to best predict MPG using the cars_19 dataset which is a dataset I created from publicly available data from the Environmental Protection Agency. It was discovered that support vector machine was clearly the winner in predicting MPG and SVM produces models with the lowest RMSE. In this post I am going to use LightGBM to build a predictive model and compare the RMSE to the other models.
The raw data is located on the EPA government site.
Similar to the other models, the variables/features I am using are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve. The LightGBM package does not handle factors so I will have to transform them into dummy variables. After creating the dummy variables, I will be using 33 input variables.
str(cars_19)
'data.frame': 1253 obs. of 12 variables:
$ fuel_economy_combined: int 21 28 21 26 28 11 15 18 17 15 ...
$ eng_disp : num 3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ...
$ num_cyl : int 6 4 8 4 4 16 8 8 8 8 ...
$ transmission : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ...
$ num_gears : int 9 6 8 7 8 7 8 8 8 7 ...
$ air_aspired_method : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ...
$ regen_brake : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ...
$ batt_capacity_ah : num 4.25 0 0 0 0 0 0 0 0 0 ...
$ drive : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ...
$ fuel_type : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ...
$ cyl_deactivate : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ...
$ variable_valve : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
One of the biggest challenges with this dataset is it is small to be running machine learning models on. The train data set is 939 rows and the test data set is only 314 rows. In an ideal situation there would be more data, but this is real data and all data that is available.
After getting a working model and performing trial and error exploratory analysis to estimate the hyperparameters, I am going to run a grid search using:
max_depth
num_leaves
num_iterations
early_stopping_rounds
learning_rate
As a general rule of thumb num_leaves = 2^(max_depth) and num leaves and max_depth need to be tuned together to prevent overfitting. Solving for max_depth:
max_depth = round(log(num_leaves) / log(2),0)
This is just a guideline, I found values for both hyperparameters higher than the final hyper_grid below caused the model to overfit.
After running a few grid searches, the final hyper_grid I am looking to optimize (minimize RMSE) is 4950 rows. This runs fairly quickly on a Mac mini with the M1 processor and 16 GB RAM making use of the early_stopping_rounds parameter.
#grid search
#create hyperparameter grid
num_leaves =seq(20,28,1)
max_depth = round(log(num_leaves) / log(2),0)
num_iterations = seq(200,400,50)
early_stopping_rounds = round(num_iterations * .1,0)
hyper_grid <- expand.grid(max_depth = max_depth,
num_leaves =num_leaves,
num_iterations = num_iterations,
early_stopping_rounds=early_stopping_rounds,
learning_rate = seq(.45, .50, .005))
hyper_grid <- unique(hyper_grid)
Running a for loop:
for (j in 1:nrow(hyper_grid)) {
set.seed(123)
light_gbn_tuned <- lgb.train(
params = list(
objective = "regression",
metric = "l2",
max_depth = hyper_grid$max_depth[j],
num_leaves =hyper_grid$num_leaves[j],
num_iterations = hyper_grid$num_iterations[j],
early_stopping_rounds=hyper_grid$early_stopping_rounds[j],
learning_rate = hyper_grid$learning_rate[j]
#feature_fraction = .9
),
valids = list(test = test_lgb),
data = train_lgb
)
yhat_fit_tuned <- predict(light_gbn_tuned,train[,2:34])
yhat_predict_tuned <- predict(light_gbn_tuned,test[,2:34])
rmse_fit[j] <- RMSE(y_train,yhat_fit_tuned)
rmse_predict[j] <- RMSE(y_test,yhat_predict_tuned)
cat(j, "\n")
}
I am going to run this model as final:
set.seed(123)
light_gbn_final <- lgb.train(
params = list(
objective = "regression",
metric = "l2",
max_depth = 4,
num_leaves =23,
num_iterations = 400,
early_stopping_rounds=40,
learning_rate = .48
#feature_fraction = .8
),
valids = list(test = test_lgb),
data = train_lgb
)
postResample(y_test,yhat_predict_final)
RMSE Rsquared MAE
1.7031942 0.9016161 1.2326575
Graph of features that are most explanatory:
sum(abs(r) <= rmse_predict_final) / length(y_test) #[1] 0.7547771
[1] 0.7547771
> sum(abs(r) <= 2 * rmse_predict_final) / length(y_test) #[1] 0.9522293
[1] 0.9522293
>
> summary(r)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-11.21159 -0.96398 0.06337 -0.02708 0.96796 5.77861
Comparison of RMSE:
svm = .93
lightGBM = 1.7
XGBoost = 1.74
gradient boosting = 1.8
random forest = 1.9
neural network = 2.06
decision tree = 2.49
mlr = 2.6
In the previous posts, I used popular machine learning algorithms to fit models to best predict MPG using the cars_19 dataset. It was discovered that support vector machine produced the lowest RMSE. In this post I am going to use XGBoost to build a predictive model and compare the RMSE to the other models.
str(cars_19)
'data.frame': 1253 obs. of 12 variables:
$ fuel_economy_combined: int 21 28 21 26 28 11 15 18 17 15 ...
$ eng_disp : num 3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ...
$ num_cyl : int 6 4 8 4 4 16 8 8 8 8 ...
$ transmission : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ...
$ num_gears : int 9 6 8 7 8 7 8 8 8 7 ...
$ air_aspired_method : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ...
$ regen_brake : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ...
$ batt_capacity_ah : num 4.25 0 0 0 0 0 0 0 0 0 ...
$ drive : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ...
$ fuel_type : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ...
$ cyl_deactivate : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ...
$ variable_valve : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
After getting a working model and performing trial and error exploratory analysis to estimate the eta and tree depth hyperparameters, I am going to run a grid search. I am going to run 64 XGBoost models
#create hyperparameter grid
hyper_grid <- expand.grid(max_depth = seq(3, 6, 1), eta = seq(.2, .35, .01))
Using a for loop and a 5 fold CV
for (j in 1:nrow(hyper_grid)) {
set.seed(123)
m_xgb_untuned <- xgb.cv(
data = train[, 2:34],
label = train[, 1],
nrounds = 1000,
objective = "reg:squarederror",
early_stopping_rounds = 3,
nfold = 5,
max_depth = hyper_grid$max_depth[j],
eta = hyper_grid$eta[j]
)
xgb_train_rmse[j] <- m_xgb_untuned$evaluation_log$train_rmse_mean[m_xgb_untuned$best_iteration]
xgb_test_rmse[j] <- m_xgb_untuned$evaluation_log$test_rmse_mean[m_xgb_untuned$best_iteration]
cat(j, "\n")
}
ETA of .25 and max tree depth of 6 produces the model with the lowest test RMSE
I am going to run this combination below:
m1_xgb <-
xgboost(
data = train[, 2:34],
label = train[, 1],
nrounds = 1000,
objective = "reg:squarederror",
early_stopping_rounds = 3,
max_depth = 6,
eta = .25
)
RMSE Rsquared MAE
1.7374 0.8998 1.231
Graph of features that are most explanatory:
Residuals:
Fit:
Comparison of RMSE:
svm = .93
XGBoost = 1.74
gradient boosting = 1.8
random forest = 1.9
neural network = 2.06
decision tree = 2.49
mlr = 2.6
> r_norm<- rnorm(1000000)
> system.time(median(r_norm))
user system elapsed
0.040 0.004 0.044
> system.time(median_rcpp(r_norm))
user system elapsed
0.011 0.000 0.011
> median(r_norm)
[1] -0.001214243
> median_rcpp(r_norm)
[1] -0.001214243
str(cars_19)
'data.frame': 1253 obs. of 12 variables:
$ fuel_economy_combined: int 21 28 21 26 28 11 15 18 17 15 ...
$ eng_disp : num 3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ...
$ num_cyl : int 6 4 8 4 4 16 8 8 8 8 ...
$ transmission : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ...
$ num_gears : int 9 6 8 7 8 7 8 8 8 7 ...
$ air_aspired_method : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ...
$ regen_brake : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ...
$ batt_capacity_ah : num 4.25 0 0 0 0 0 0 0 0 0 ...
$ drive : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ...
$ fuel_type : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ...
$ cyl_deactivate : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ...
$ variable_valve : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
cols <- feature_columns(
column_numeric(colnames(cars_19[c(2, 3, 5, 8)])),
column_embedding(column_categorical_with_identity("transmission", num_buckets = 7),dimension = 1),
column_embedding(column_categorical_with_identity("air_aspired_method", num_buckets = 5),dimension=1),
column_embedding(column_categorical_with_identity("regen_brake", num_buckets = 3),dimension=1),
column_embedding(column_categorical_with_identity("drive", num_buckets = 5),dimension=1),
column_embedding(column_categorical_with_identity("fuel_type", num_buckets = 5),dimension=1),
column_embedding(column_categorical_with_identity("cyl_deactivate", num_buckets = 2),dimension=1),
column_embedding(column_categorical_with_identity("variable_valve", num_buckets = 2),dimension=1)
)
#Create a deep neural network (DNN) estimator.
model <- dnn_regressor(hidden_units=c(7,3),feature_columns = cols)
set.seed(123)
indices <- sample(1:nrow(cars_19), size = 0.75 * nrow(cars_19))
train <- cars_19[indices, ]
test <- cars_19[-indices, ]
#train model
model %>% train(cars_19_input_fn(train, num_epochs = 1000))
#evaluate model
model %>% evaluate(cars_19_input_fn(test))
#predict
yhat <- model %>% predict(cars_19_input_fn(test))
yhat <- unlist(yhat)
y <- test$fuel_economy_combined
postResample(yhat, y)
RMSE Rsquared MAE
1.9640173 0.8700275 1.4838347
str(cars_19)
'data.frame': 1253 obs. of 12 variables:
$ fuel_economy_combined: int 21 28 21 26 28 11 15 18 17 15 ...
$ eng_disp : num 3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ...
$ num_cyl : int 6 4 8 4 4 16 8 8 8 8 ...
$ transmission : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ...
$ num_gears : int 9 6 8 7 8 7 8 8 8 7 ...
$ air_aspired_method : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ...
$ regen_brake : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ...
$ batt_capacity_ah : num 4.25 0 0 0 0 0 0 0 0 0 ...
$ drive : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ...
$ fuel_type : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ...
$ cyl_deactivate : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ...
$ variable_valve : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
maxs <- apply(cars_19[, c(1:3, 5, 8)], 2, max)
mins <- apply(cars_19[, c(1:3, 5, 8)], 2, min)
scaled <- as.data.frame(scale(cars_19[, c(1:3, 5, 8)], center = mins, scale = maxs - mins))
tmp <- data.frame(scaled, cars_19[, c(4, 6, 7, 9:12)])
n <- names(cars_19)
f <- as.formula(paste("fuel_economy_combined ~", paste(n[!n %in% "fuel_economy_combined"], collapse = " + ")))
m <- model.matrix(f, data = tmp)
m <- as.matrix(data.frame(m, tmp[, 1]))
colnames(m)[28] <- "fuel_economy_combined")
r = (INP_num/OUT_num)^(1/3)
HID1_num = OUT_num*(r^2) #number of neurons in the first hidden layer
HID2_num = OUT_num*r #number of neurons in the second hidden layer
set.seed(123)
indices <- sample(1:nrow(cars_19), size = 0.75 * nrow(cars_19))
train <- m[indices,]
test <- m[-indices,]
n <- colnames(m)[2:28]
f <- as.formula(paste("fuel_economy_combined ~", paste(n[!n %in% "fuel_economy_combined"], collapse = " + ")))
m1_nn <- neuralnet(f,
data = train,
hidden = c(7,3),
linear.output = TRUE)
pred_nn <- predict(m1_nn, test)
yhat <-pred_nn * (max(cars_19$fuel_economy_combined) - min(cars_19$fuel_economy_combined)) + min(cars_19$fuel_economy_combined)
y <- test[, 28] * (max(cars_19$fuel_economy_combined) - min(cars_19$fuel_economy_combined)) +min(cars_19$fuel_economy_combined)
postResample(yhat, y)
RMSE Rsquared MAE
2.0036294 0.8688363 1.4894264
set.seed(123)
stats <- NULL
for (i in 1:20) {
indices <- sample(1:nrow(cars_19), size = 0.75 * nrow(cars_19))
train_tmp <- m[indices, ]
test_tmp <- m[-indices, ]
nn_tmp <- neuralnet(f,
data = train_tmp,
hidden = c(7, 3),
linear.output = TRUE)
pred_nn_tmp <- predict(nn_tmp, test_tmp)
yhat <- pred_nn_tmp * (max(cars_19$fuel_economy_combined) - min(cars_19$fuel_economy_combined)) + min(cars_19$fuel_economy_combined)
y <- test_tmp[, 28] * (max(cars_19$fuel_economy_combined) - min(cars_19$fuel_economy_combined)) + min(cars_19$fuel_economy_combined)
stats_tmp <- postResample(yhat, y)
stats <- rbind(stats, stats_tmp)
cat(i, "\n")
}
mean(stats[, 1] ^ 2) #avg mse 4.261991
mean(stats[, 1] ^ 2) ^ .5 #avg rmse 2.064459
colMeans(stats) #ignore rmse
#RMSE Rsquared MAE
#xxx 0.880502 1.466458
Comparison of RMSE:
svm = .93
gradient boosting = 1.8
random forest = 1.9
neural network = 2.06
decision tree = 2.49
mlr = 2.6
str(cars_19)
'data.frame': 1253 obs. of 12 variables:
$ fuel_economy_combined: int 21 28 21 26 28 11 15 18 17 15 ...
$ eng_disp : num 3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ...
$ num_cyl : int 6 4 8 4 4 16 8 8 8 8 ...
$ transmission : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ...
$ num_gears : int 9 6 8 7 8 7 8 8 8 7 ...
$ air_aspired_method : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ...
$ regen_brake : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ...
$ batt_capacity_ah : num 4.25 0 0 0 0 0 0 0 0 0 ...
$ drive : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ...
$ fuel_type : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ...
$ cyl_deactivate : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ...
$ variable_valve : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
cols <- feature_columns(
column_numeric(colnames(cars_19[c(2, 3, 5, 8)])),
column_categorical_with_identity("transmission", num_buckets = 7),
column_categorical_with_identity("air_aspired_method", num_buckets = 5),
column_categorical_with_identity("regen_brake", num_buckets = 3),
column_categorical_with_identity("drive", num_buckets = 5),
column_categorical_with_identity("fuel_type", num_buckets = 5),
column_categorical_with_identity("cyl_deactivate", num_buckets = 2),
column_categorical_with_identity("variable_valve", num_buckets = 2)
)
#input_fn for a given subset of data
cars_19_input_fn <- function(data, num_epochs = 1) {
input_fn(
data,
features = colnames(cars_19[c(2:12)]),
response = "fuel_economy_combined",
batch_size = 64,
num_epochs = num_epochs
)
}
model <- linear_regressor(feature_columns = cols)
set.seed(123)
indices <- sample(1:nrow(cars_19), size = 0.75 * nrow(cars_19))
train <- cars_19[indices, ]
test <- cars_19[-indices, ]
#train model
model %>% train(cars_19_input_fn(train, num_epochs = 1000))
#evaluate model
model %>% evaluate(cars_19_input_fn(test))
#predict
yhat <- model %>% predict(cars_19_input_fn(test))
postResample(yhat, y)
RMSE Rsquared MAE
2.5583185 0.7891934 1.9381757
fuel_economy_combined eng_disp num_cyl transmission
Min. :11.00 Min. :1.000 Min. : 3.000 A :301
1st Qu.:19.00 1st Qu.:2.000 1st Qu.: 4.000 AM : 46
Median :23.00 Median :3.000 Median : 6.000 AMS: 87
Mean :23.32 Mean :3.063 Mean : 5.533 CVT: 50
3rd Qu.:26.00 3rd Qu.:3.600 3rd Qu.: 6.000 M :148
Max. :58.00 Max. :8.000 Max. :16.000 SA :555
SCV: 66
num_gears air_aspired_method
Min. : 1.000 Naturally Aspirated :523
1st Qu.: 6.000 Other : 5
Median : 7.000 Supercharged : 55
Mean : 7.111 Turbocharged :663
3rd Qu.: 8.000 Turbocharged+Supercharged: 7
Max. :10.000
regen_brake batt_capacity_ah
No :1194 Min. : 0.0000
Electrical Regen Brake: 57 1st Qu.: 0.0000
Hydraulic Regen Brake : 2 Median : 0.0000
Mean : 0.3618
3rd Qu.: 0.0000
Max. :20.0000
drive cyl_deactivate
2-Wheel Drive, Front :345 Y: 172
2-Wheel Drive, Rear :345 N:1081
4-Wheel Drive :174
All Wheel Drive :349
Part-time 4-Wheel Drive: 40
fuel_type
Diesel, ultra low sulfur (15 ppm, maximum): 28
Gasoline (Mid Grade Unleaded Recommended) : 16
Gasoline (Premium Unleaded Recommended) :298
Gasoline (Premium Unleaded Required) :320
Gasoline (Regular Unleaded Recommended) :591
variable_valve
N: 38
Y:1215
set.seed(123)
m_svm_untuned <- svm(formula = fuel_economy_combined ~ .,
data = test)
pred_svm_untuned <- predict(m_svm_untuned, test)
yhat <- pred_svm_untuned
y <- test$fuel_economy_combined
svm_stats_untuned <- postResample(yhat, y)
svm_stats_untuned
RMSE Rsquared MAE
2.3296249 0.8324886 1.4964907
hyper_grid <- expand.grid(
cost = 2^seq(-5,5,1),
gamma= 2^seq(-5,5,1)
)
e <- NULL
for(j in 1:nrow(hyper_grid)){
set.seed(123)
m_svm_untuned <- svm(
formula = fuel_economy_combined ~ .,
data = train,
gamma = hyper_grid$gamma[j],
cost = hyper_grid$cost[j]
)
pred_svm_untuned <-predict(
m_svm_untuned,
newdata = test
)
yhat <- pred_svm_untuned
y <- test$fuel_economy_combined
e[j] <- postResample(yhat, y)[1]
cat(j, "\n")
}
which.min(e) #minimum MSE
set.seed(123)
m_svm_tuned <- svm(
formula = fuel_economy_combined ~ .,
data = test,
gamma = .25,
cost = 32,
scale=TRUE
)
pred_svm_tuned <- predict(m_svm_tuned,test)
yhat<-pred_svm_tuned
y<-test$fuel_economy_combined
svm_stats<-postResample(yhat,y)
svm_stats
RMSE Rsquared MAE
0.9331948 0.9712492 0.7133039
summary(m_svm_tuned)
Call:
svm(formula = fuel_economy_combined ~ ., data = test, gamma = 0.25, cost = 32, scale = TRUE)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 32
gamma: 0.25
epsilon: 0.1
Number of Support Vectors: 232
sum(abs(res)<=1) / 314
[1] 0.8503185
tmp[which(abs(res) > svm_stats[1] * 3), ] #what cars are 3 se residuals
Division Carline fuel_economy_combined pred_svm_tuned
641 HYUNDAI MOTOR COMPANY Ioniq 55 49.01012
568 TOYOTA CAMRY XSE 26 22.53976
692 Volkswagen Arteon 4Motion 23 26.45806
984 Volkswagen Atlas 19 22.23552
fuel_economy_combined eng_disp num_cyl transmission
Min. :11.00 Min. :1.000 Min. : 3.000 A :301
1st Qu.:19.00 1st Qu.:2.000 1st Qu.: 4.000 AM : 46
Median :23.00 Median :3.000 Median : 6.000 AMS: 87
Mean :23.32 Mean :3.063 Mean : 5.533 CVT: 50
3rd Qu.:26.00 3rd Qu.:3.600 3rd Qu.: 6.000 M :148
Max. :58.00 Max. :8.000 Max. :16.000 SA :555
SCV: 66
num_gears air_aspired_method
Min. : 1.000 Naturally Aspirated :523
1st Qu.: 6.000 Other : 5
Median : 7.000 Supercharged : 55
Mean : 7.111 Turbocharged :663
3rd Qu.: 8.000 Turbocharged+Supercharged: 7
Max. :10.000
regen_brake batt_capacity_ah
No :1194 Min. : 0.0000
Electrical Regen Brake: 57 1st Qu.: 0.0000
Hydraulic Regen Brake : 2 Median : 0.0000
Mean : 0.3618
3rd Qu.: 0.0000
Max. :20.0000
drive cyl_deactivate
2-Wheel Drive, Front :345 Y: 172
2-Wheel Drive, Rear :345 N:1081
4-Wheel Drive :174
All Wheel Drive :349
Part-time 4-Wheel Drive: 40
fuel_type
Diesel, ultra low sulfur (15 ppm, maximum): 28
Gasoline (Mid Grade Unleaded Recommended) : 16
Gasoline (Premium Unleaded Recommended) :298
Gasoline (Premium Unleaded Required) :320
Gasoline (Regular Unleaded Recommended) :591
variable_valve
N: 38
Y:1215
trees <- 1200
m_boosted_reg_untuned <- gbm(
formula = fuel_economy_combined ~ .,
data = train,
n.trees = trees,
distribution = "gaussian"
)
> summary(m_boosted_reg_untuned)
var rel.inf
eng_disp eng_disp 41.26273684
batt_capacity_ah batt_capacity_ah 24.53458898
transmission transmission 11.33253784
drive drive 8.59300859
regen_brake regen_brake 8.17877824
air_aspired_method air_aspired_method 2.11397865
num_gears num_gears 1.90999021
fuel_type fuel_type 1.65692562
num_cyl num_cyl 0.22260369
variable_valve variable_valve 0.11043532
cyl_deactivate cyl_deactivate 0.08441602
> boosted_stats_untuned
RMSE Rsquared MAE
2.4262643 0.8350367 1.7513331
#create hyperparameter grid
hyper_grid <- expand.grid(
shrinkage = seq(.07, .12, .01),
interaction.depth = 1:7,
optimal_trees = 0,
min_RMSE = 0
)
#grid search
for (i in 1:nrow(hyper_grid)) {
set.seed(123)
gbm.tune <- gbm(
formula = fuel_economy_combined ~ .,
data = train_random,
distribution = "gaussian",
n.trees = 5000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
)
hyper_grid$optimal_trees[i] <- which.min(gbm.tune$train.error)
hyper_grid$min_RMSE[i] <- sqrt(min(gbm.tune$train.error))
cat(i, "\n")
}
> head(hyper_grid)
shrinkage interaction.depth optimal_trees min_RMSE
1 0.07 1 0 0
2 0.08 1 0 0
3 0.09 1 0 0
4 0.10 1 0 0
5 0.11 1 0 0
6 0.12 1 0 0
> m_boosted_reg <- gbm(
formula = fuel_economy_combined ~ .,
data = train,
n.trees = trees,
distribution = "gaussian",
shrinkage = .09,
cv.folds = 5,
interaction.depth = 5
)
best.iter <- gbm.perf(m_boosted_reg, method = "cv")
pred_boosted_reg_ <- predict(m_boosted_reg,n.trees=1183, newdata = test)
mse_boosted_reg_ <- RMSE(pred = pred_boosted_reg, obs = test$fuel_economy_combined) ^2
boosted_stats<-postResample(pred_boosted_reg,test$fuel_economy_combined)
> pred_boosted_reg <- predict(m_boosted_reg,n.trees=1183, newdata = test)
> mse_boosted_reg <- RMSE(pred = pred_boosted_reg, obs = test$fuel_economy_combined) ^2
> boosted_stats<-postResample(pred_boosted_reg,test$fuel_economy_combined)
> boosted_stats
RMSE Rsquared MAE
1.8018793 0.9092727 1.3334459
> mse_boosted_reg
3.246769
> summary(res)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.40000 -0.90000 0.00000 0.07643 1.10000 9.10000
> tmp[which(abs(res) > boosted_stats[1] * 3), ]
Division Carline fuel_economy_combined pred_boosted_reg
642 HYUNDAI MOTOR COMPANY Ioniq Blue 58 48.5
482 KIA MOTORS CORPORATION Forte FE 35 28.7
39 Lamborghini Aventador Coupe 11 17.2
40 Lamborghini Aventador Roadster 11 17.2
fuel_economy_combined eng_disp num_cyl transmission
Min. :11.00 Min. :1.000 Min. : 3.000 A :301
1st Qu.:19.00 1st Qu.:2.000 1st Qu.: 4.000 AM : 46
Median :23.00 Median :3.000 Median : 6.000 AMS: 87
Mean :23.32 Mean :3.063 Mean : 5.533 CVT: 50
3rd Qu.:26.00 3rd Qu.:3.600 3rd Qu.: 6.000 M :148
Max. :58.00 Max. :8.000 Max. :16.000 SA :555
SCV: 66
num_gears air_aspired_method
Min. : 1.000 Naturally Aspirated :523
1st Qu.: 6.000 Other : 5
Median : 7.000 Supercharged : 55
Mean : 7.111 Turbocharged :663
3rd Qu.: 8.000 Turbocharged+Supercharged: 7
Max. :10.000
regen_brake batt_capacity_ah
No :1194 Min. : 0.0000
Electrical Regen Brake: 57 1st Qu.: 0.0000
Hydraulic Regen Brake : 2 Median : 0.0000
Mean : 0.3618
3rd Qu.: 0.0000
Max. :20.0000
drive cyl_deactivate
2-Wheel Drive, Front :345 Y: 172
2-Wheel Drive, Rear :345 N:1081
4-Wheel Drive :174
All Wheel Drive :349
Part-time 4-Wheel Drive: 40
fuel_type
Diesel, ultra low sulfur (15 ppm, maximum): 28
Gasoline (Mid Grade Unleaded Recommended) : 16
Gasoline (Premium Unleaded Recommended) :298
Gasoline (Premium Unleaded Required) :320
Gasoline (Regular Unleaded Recommended) :591
variable_valve
N: 38
Y:1215
Call:
lm(formula = fuel_economy_combined ~ eng_disp + transmission +
num_gears + air_aspired_method + regen_brake + batt_capacity_ah +
drive + fuel_type + cyl_deactivate + variable_valve, data = cars_19)
Residuals:
Min 1Q Median 3Q Max
-12.7880 -1.6012 0.1102 1.6116 17.3181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.05642 0.82585 43.660 < 2e-16 ***
eng_disp -2.79257 0.08579 -32.550 < 2e-16 ***
transmissionAM 2.74053 0.44727 6.127 1.20e-09 ***
transmissionAMS 0.73943 0.34554 2.140 0.032560 *
transmissionCVT 6.83932 0.62652 10.916 < 2e-16 ***
transmissionM 1.08359 0.31706 3.418 0.000652 ***
transmissionSA 0.63231 0.22435 2.818 0.004903 **
transmissionSCV 2.73768 0.40176 6.814 1.48e-11 ***
num_gears 0.21496 0.07389 2.909 0.003691 **
air_aspired_methodOther -2.70781 1.99491 -1.357 0.174916
air_aspired_methodSupercharged -1.62171 0.42210 -3.842 0.000128 ***
air_aspired_methodTurbocharged -1.79047 0.22084 -8.107 1.24e-15 ***
air_aspired_methodTurbocharged+Supercharged -1.68028 1.04031 -1.615 0.106532
regen_brakeElectrical Regen Brake 12.59523 0.90030 13.990 < 2e-16 ***
regen_brakeHydraulic Regen Brake 6.69040 1.94379 3.442 0.000597 ***
batt_capacity_ah -0.47689 0.11838 -4.028 5.96e-05 ***
drive2-Wheel Drive, Rear -2.54806 0.24756 -10.293 < 2e-16 ***
drive4-Wheel Drive -3.14862 0.29649 -10.620 < 2e-16 ***
driveAll Wheel Drive -3.12875 0.22300 -14.030 < 2e-16 ***
drivePart-time 4-Wheel Drive -3.94765 0.46909 -8.415 < 2e-16 ***
fuel_typeGasoline (Mid Grade Unleaded Recommended) -5.54594 0.97450 -5.691 1.58e-08 ***
fuel_typeGasoline (Premium Unleaded Recommended) -5.44412 0.70009 -7.776 1.57e-14 ***
fuel_typeGasoline (Premium Unleaded Required) -6.01955 0.70542 -8.533 < 2e-16 ***
fuel_typeGasoline (Regular Unleaded Recommended) -6.43743 0.68767 -9.361 < 2e-16 ***
cyl_deactivateY 0.52100 0.27109 1.922 0.054851 .
variable_valveY 2.00533 0.59508 3.370 0.000775 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
standard error: 2.608 on 1227 degrees of freedom
Multiple R-squared: 0.8104, Adjusted R-squared: 0.8066
F-statistic: 209.8 on 25 and 1227 DF, p-value: < 2.2e-16
#regression tree full
m_reg_tree_full <- rpart(formula = fuel_economy_combined ~ .,
data = train,
method = "anova",)
#regression tree tuned
m_reg_tree_trimmed <- rpart(
formula = fuel_economy_combined ~ .,
data = train,
method = "anova",
control = list(minsplit = 10, cp = .0005)
)
#rpart.plot(m_reg_tree_full)
plotcp(m_reg_tree_full)
pred_decision_tree_full <- predict(m_reg_tree_full, newdata = test)
mse_tree_full <- RMSE(pred = pred_decision_tree_full, obs = test$fuel_economy_combined) ^2
pred_decision_tree_trimmed <- predict(m_reg_tree_trimmed, newdata = test)
mse_tree_trimmed <- RMSE(pred = pred_decision_tree_trimmed, obs = test$fuel_economy_combined) ^2
plotcp(m_reg_tree_trimmed)
#random forest
m_random_forest_full <-randomForest(formula = fuel_economy_combined ~ ., data = train)
predict_random_forest_full <- predict(m_random_forest_full, newdata = test)
mse_random_forest_full <- RMSE(pred = predict_random_forest_full, obs = test$fuel_economy_combined) ^ 2
which.min(m_random_forest_full$mse)
#random forest tuned
m_random_forest <- randomForest(formula = fuel_economy_combined ~ ., data = train, ntree = 250)
plot(m_random_forest)
predict_random_forest <- predict(m_random_forest, newdata = test)
mse_random_forest <- RMSE(pred = predict_random_forest, obs = test$fuel_economy_combined) ^ 2
plot(tmp$test.fuel_economy_combined - tmp$r.predict_random_forrest., ylab = "residuals",main = "Random Forest")
varImpPlot(m_random_forest)