R: Improving Regression Speed with Rcpp and RcppArmadillo

I am going to demonstrate how to improve speed in R when performing multiple linear regression. Below I compare three methods:

The standard built in R function for regression is lm(). It is the slowest. A bare bones R implementation is lm.fit() which is substantially faster than lm() but still slow. The fastest method to perform multiple linear regression is to use Rcpp and RcppArmadillo which is the C++ Armadillo linear algebra library.  

A 1253 x 26 design matrix (X) is built from the cars_19 dataset and a simulation is run to compare the three methods:

The cars_19 dataset from previous posts:

str(cars_19)
'data.frame':    1253 obs. of  12 variables:
 $ fuel_economy_combined: int  21 28 21 26 28 11 15 18 17 15 ...
 $ eng_disp             : num  3.5 1.8 4 2 2 8 6.2 6.2 6.2 6.2 ...
 $ num_cyl              : int  6 4 8 4 4 16 8 8 8 8 ...
 $ transmission         : Factor w/ 7 levels "A","AM","AMS",..: 3 2 6 3 6 3 6 6 6 5 ...
 $ num_gears            : int  9 6 8 7 8 7 8 8 8 7 ...
 $ air_aspired_method   : Factor w/ 5 levels "Naturally Aspirated",..: 4 4 4 4 4 4 3 1 3 3 ...
 $ regen_brake          : Factor w/ 3 levels "","Electrical Regen Brake",..: 2 1 1 1 1 1 1 1 1 1 ...
 $ batt_capacity_ah     : num  4.25 0 0 0 0 0 0 0 0 0 ...
 $ drive                : Factor w/ 5 levels "2-Wheel Drive, Front",..: 4 2 2 4 2 4 2 2 2 2 ...
 $ fuel_type            : Factor w/ 5 levels "Diesel, ultra low sulfur (15 ppm, maximum)",..: 4 3 3 5 3 4 4 4 4 4 ...
 $ cyl_deactivate       : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 2 2 1 ...
 $ variable_valve       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...


Each function lm(), lm.fit(), and lm_rcpp() is run 5000 times and the average system time is measured.  

The code for the C++ implementation of multiple linear regression using Rcpp and RcppArmadillo is below:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat lm_rcpp(arma::mat X, arma::vec y)
{
arma::vec b_hat;
b_hat = (X.t() * X).i() * X.t() * y;
return (b_hat);
}


 

Multiple linear regression using Rcpp and RcppArmadillo is multiples times faster than the standard R functions! 

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat lm_rcpp(arma::mat X, arma::vec y)
{
arma::vec b_hat;
b_hat = (X.t() * X).i() * X.t() * y;
return (b_hat);
}
view raw regression.cpp hosted with ❤ by GitHub