R: Birthday Problem

An interesting and classic probability question is the birthday problem.

The birthday problem asks how many individuals are required to be in one location so there is a probability of 50% that at least two individuals in the group have the same birthday.

To solve:

If there are just 23 people in one location there is a 50.7% probability there will be at least one pair with the same birthday.

R: K-Means Clustering MLB Data

k-means clustering is a useful unsupervised learning data mining tool for assigning n observations into k groups which allows a practitioner to segment a dataset.

I play in a fantasy baseball league and using five offensive variables (R, AVG, HR, RBI, SB) I am going to use k-means clustering to:

1) Determine how many coherent groups there are in major league baseball. For example,
is there a power and high average group? Is there a low power, high average, and speed group?

2) Assign players to these groups to determine which players are similar or can act as replacements. I am not using this algorithm to predict how players will perform in 2017.

For a data source I am going to use all MLB offensive players in 2016 which had at least 400 plate appearances from http://www.baseball-reference.com This dataset has n= 256 players.

Sample data below

Step 1
How many k groups should I use?

The within groups sum of squares plot below suggests k=7 groups is ideal. k=9 is too many groups for n=256 and the silhouette plot for k=9 is poor.

Step 2
Is k=7 groups a good solution?

Let's look at a silhouette plot to look at the fit of each cluster and the overall k=7 clusters.

The average silhouette width = .64 indicates a reasonable structure has been found. Cluster 4 which is the speed group has a low silhouette width of .37. I am OK with this as it is the smallest group and speed is the hardest offensive tool to find in MLB.

Step 3
Calculate group means for k=7 groups

Players that are classified in cluster 3 are the elite players in MLB. Based on 2016 stats, 31 players make up cluster 3. On average they have the highest AVG, R, RBI, HR, and the second highest SB. 

Adding Hampel Psi Function to robustreg

I am currently working on adding the Hampel psi function to robustreg.

Pure R implementation:

psiHampel <- function(r, a, b, c) {
    psi <- NULL
    for (i in 1:length(r)) {
        if (abs(r[i]) <= a) {
            psi[i] <- r[i]
        } else if (abs(r[i]) > a & abs(r[i]) <= b) {
            psi[i] <- a * sign(r[i])
        } else if (abs(r[i]) > b & abs(r[i]) <= c) {
            psi[i] <- (a * (c - abs(r[i]))/(c - b)) * sign(r[i])
        } else {
            psi[i] <- 0

Rcpp/C++ implementation:

#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
using namespace std;

// [[Rcpp::export]]
NumericVector psiHampel_rcpp(NumericVector r, double a, double b, double c)
    int n = r.size();
    NumericVector y = clone(r);

    for (int i = 0; i < n; i++) {
        if (abs(r[i]) <= a) {
            y[i] = r[i];
        else if (
            abs(r[i]) > a && abs(r[i]) <= b) {
            if (r[i] > 0) {
                y[i] = a;
            else {
                y[i] = -a;
        else if (
            abs(r[i]) > b && abs(r[i]) <= c) {
            if (r[i] > 0) {
                y[i] = (a * (c - abs(r[i])) / (c - b));
            else {
                y[i] = (a * (c - abs(r[i])) / (c - b) * -1);
        else {
            y[i] = 0;
    return y;

Below is a ggplot2 graph of values a=2, b=4, c=8 where x ranges from -10 to 10:

all.equal(psiHampel(z,2,4,8), psiHampel_rcpp(z,2,4,8))
[1] TRUE

Both functions have the same result although the C++ implementation is much faster.

R: Baseball Pitching Wins Above Replacement and Salary

Are elite baseball pitchers worth their salaries? To investigate this I fitted linear and robust linear models using Baseball-Reference.com definition of wins above replacement. The universe of data is all pitchers in the year 2015 who started at least 20 games or as a reliever pitched 70 outs which is 298 players.

The distribution of war is below:

 Min.      1st Qu. Median Mean    3rd Qu. Max.
-1.79      0.2125  0.95      1.219   1.905     9.28

standard dev: 1.61
mad:  1.245

Distribution of Pitching WAR

Fitting a linear model to the data:

We can see that there is a linear relationship between salary and war as salary is significant.  The correlation between the two variables is ~.28 but the MSE is 2.4 which indicates the model isn't the best fit.  The linear model suggests that each additional $11,883,599 in player salary should buy a win over a replacement player.

Investigating a robust model:

The robust model is a better fit as MSE is significantly lower at 1.47.  The correlation between salary and war is slightly lower at .277.  The advantage of a robust model is it will minimize outliers by assigning lower weights.  The robust model suggests that each additional $14,065,063 in player salary should buy a win over a replacement player.

Below is a table of 2015 war, 2015 salary, and the upper prediction interval of war based on 2015 salary:

The top ten pitchers in 2015 all had better seasons than the upper end of the prediction interval based on their salary. 

R Text Mining: The Wells Report

wells report word cloud
Word Cloud Wells Report

Story: Patriots Deflated Footballs
Document to be text mined: Wells Report

How to Create Word Cloud using R:

Packages needed:
tm: Text Mining Package
RColorBrewer: ColorBrewer Palettes
wordcloud: Word Clouds

1) Convert PDF to text file using pdftotext

2) Clean the document and remove numbers, punctuation, symbols, and stop words.
 library(tm) #text mining   
 source<-DirSource("~/Text") #save text file(s) here   
 a<-Corpus(source, readerControl=list(reader=readPlain))   
 a <- tm_map(a, content_transformer(removeNumbers))   
 a <- tm_map(a, content_transformer(removePunctuation))   
 a <- tm_map(a, content_transformer(tolower))   
 a<- tm_map(a, stripWhitespace)   
 a[[1]] <- removeWords(a[[1]], stopwords("en"))   
 a<- tm_map(a, stripWhitespace)   
3) Examine the corpus and replace words if necessary. Since the Wells report was written by two parties, a lawfirm and Exponent some of the terms were inconsistent. This is how I changed "psig" to "psi" which were used interchangeably in the document:
 a[[1]]<- gsub( "psig" , "psi" , a[[1]])   
 a<- tm_map(a, PlainTextDocument) #neccessary after word replacement  
4) Create term document matrix and dataframe of keywords:
 tdm<- TermDocumentMatrix(a, control = list(minWordLength = 3))   
5) Create and format word cloud:
 library(RColorBrewer) #colors wordcloud   
 tdm.m <- as.matrix(tdm)   
 tdm.v <- sort(rowSums(tdm.m),decreasing=TRUE)   
 tdm.d <- data.frame(word = names(tdm.v),freq=tdm.v)   
 pal2 <- brewer.pal(8,"Dark2")   
 png("wells_report.png", width=8,height=8, units='in', res=400)   
 wordcloud(tdm.d$word,tdm.d$freq, scale=c(8,.2),min.freq=5, max.words=Inf,  random.order=FALSE, rot.per=.15, colors=pal2)   

R Regression: Comparing Speed Using lm() lm.fit() and RCPP

One of the problems of R is speed and memory. Below I compare three methods to perform multiple linear regression.

The built in R function is lm(). It is the slowest. A bare bones R implementation is lm.fit() which is substantially faster but still slow. The fastest method is to use Rcpp and RcppArmadillo which is the C++ Armadillo linear algebra library.

Using a 31400 x 4 design matrix a simulation is run to compare the three methods:

A simulation of 1000 multiple linear regressions using the R function lm() provides the below average system time:

> mean(s_lm) [1] 0.067614

A simulation of 1000 multiple linear regressions using the R function lm.fit() provides the below average system time:

> mean(s_lmfit) [1] 0.006888
This is an improvement of almost 9 times over lm()

A simulation of 1000 multiple linear regressions using the C++ implementation using Rcpp and RcppArmadillo code below:
 // [[Rcpp::depends(RcppArmadillo)]]  
 #include <RcppArmadillo.h>  
 using namespace Rcpp;  
 using namespace arma;  
 // [[Rcpp::export]]  
 arma::mat lm_rcpp(arma::mat X,arma::vec y) {  
 arma::vec b_hat;  
 b_hat = (X.t()*X).i()*X.t()*y;  
> mean(s_rcpp) [1] 0.002169

The Rcpp code is 30 times faster than the basic R lm() implementation!

Robust Regression Package for R

I wrote this package in 2006 when the major statistical software companies did not have a robust regression package available.  It has been downloaded over 100k times.  Using iteratively reweighted least squares (IRLS), the function calculates the optimal weights to perform m-estimator or bounded influence regression. Returns robust beta estimates and prints robust ANOVA table using either a Huber or bisquare function.

Recent changes make the structure of the arguments similar to glm() or lm() and speed has dramatically increased.

Download Robust Regression Package for R

*Updated 8/25/2015 version  robustreg_0.1-9
*R functions median(), mad() replaced with faster Rcpp equivalents
*R functions Huber and bisquare replaced with faster Rcpp equivalents

Three Card Poker Game for R

A few years ago I created a comprehensive simulation of the card game three card poker for a client in Las Vegas.  I recently converted the simulation into a simple command line game for R.

Three Card Poker Game for R (Linux)
Three Card Poker Game for R (Windows)

UPDATED 7/16/2015

Regression / Time Series BP Stock Model

The date of the oil spill in the Gulf of Mexico was April 20, 2010.  British Petroleum stock has declined 54.6% from the close of 4/20/2010 to close 6/25/10 which is 48 trading days.  Volume has gone from 3.8 million on April 20th to 93 million on June 25th with a max of 240 million on June 9th. I am going to fit a simple regression and time series model to predict BP stock price based solely on price and volume. 

Below is an ANOVA table of a regression between change in price as a dependent variable and change in volume as an independent variable.  The negative sign for change in volume indicates as volume increases, BP stock has declined over this time period and is statistically significant.


The residuals are auto-correlated as this is a time series.  I am going to fit an AR(2) model to the residuals of the regression.  An AR(2) turned out to be the best fit after examining various ARIMA models and GARCH / ARCH models. 

To examine the fit of the model, an estimator of variance needs to be created using the actual stock prices and the fitted values.  Mean squared error or MSE is calculated below:

For this model, MSE = 2.8.

For only looking at price changes and volume, this is a fairly accurate fit and predictor although there are a few trouble points.  Accuracy could be increased by adding to the model.  

Disclaimer:  Please note this is a demonstration and for academic use only. 

Benefits and Uses of Statistical Research

Identify Risk or Opportunity
Statistical research and data mining models can be used to identify both specific risk or opportunity to a company. Credit card companies use data mining models to identify possible fraudulent transactions or the probability of a consumer to default on a loan or miss a payment. Statistical research can also be used to identify high quality consumers that will minimize possible borrowing risks and maximize earnings.

Market Segmentation
Statistical research can be used to identify high quality consumers who are profitable to retain and low quality consumers who are not profitable to retain. High quality consumers who are at risk of canceling a service might respond to a particular marketing strategy with a higher probability rather than an alternative strategy. Statistical research and predictive modeling techniques can be used to maximize the retention of quality consumers.

Cross Selling
Companies today collect vast data concerning their customers including demographics and purchasing habits and behavior.  Statistical research and data mining models can be used to identify current consumers that are likely to purchase additional products from organizations that maintain and collect elaborate data.

Identify New Markets
Statistical research can be used to identify new markets and opportunities. A properly designed survey and sampling techniques can be used to identify consumers that are likely to purchase a brand new product and whether it would be profitable for a company to bring it to market.

Minimize Variability of a Process
Many times a company may be more concerned about the variability of a response around its mean rather than the actual mean response. Statistical research can be used to ensure homogeneity of product rather than products that are manufactured with different tolerance levels.  As an example, a company manufacturing semiconductors that need to fit into another company's motherboard would want to minimize variability in dimensions and thickness to ensure the products fit and are compatible rather than focus on the mean of product dimensions. 

Efficacy of a Process or Product using Design of Experiments
Proper experimental design including randomization, replication, and blocking (if necessary) can determine if a drug, diet, exercise program, etc. is effective versus another.  Choosing the correct design before the experiment and appropriate factors and interactions to investigate is critical.  Types of designs include completely randomized designs (CRD), CRD with blocking, split plot designs, full and partial factorial designs, etc.