Data Science, Machine Learning and Predictive Analytics: 2015

Are elite baseball pitchers worth their salaries? To investigate this I fitted linear and robust linear models using Baseball-Reference.com definition of wins above replacement. The universe of data is all pitchers in the year 2015 who started at least 20 games or as a reliever pitched 70 outs which is 298 players.

The distribution of war is below:

Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.79 0.2125 0.95 1.219 1.905 9.28

standard dev: 1.61
mad: 1.245

Fitting a linear model to the data:

We can see that there is a linear relationship between salary and war as salary is significant. The correlation between the two variables is ~.28 but the MSE is 2.4 which indicates the model isn't the best fit. The linear model suggests that each additional $11,883,599 in player salary should buy a win over a replacement player.

Investigating a robust model:

The robust model is a better fit as MSE is significantly lower at 1.47. The correlation between salary and war is slightly lower at .277. The advantage of a robust model is it will minimize outliers by assigning lower weights. The robust model suggests that each additional $14,065,063 in player salary should buy a win over a replacement player.

Below is a table of 2015 war, 2015 salary, and the upper prediction interval of war based on 2015 salary:

The top ten pitchers in 2015 all had better seasons than the upper end of the prediction interval based on their salary.

Word Cloud Wells Report

Story: Patriots Deflated Footballs
Document to be text mined: Wells Report

How to Create Word Cloud using R:

Packages needed:
tm: Text Mining Package
RColorBrewer: ColorBrewer Palettes
wordcloud: Word Clouds

1) Convert PDF to text file using pdftotext

2) Clean the document and remove numbers, punctuation, symbols, and stop words.

 library(tm) #text mining   
 source<-DirSource("~/Text") #save text file(s) here   
 a<-Corpus(source, readerControl=list(reader=readPlain))   
 a <- tm_map(a, content_transformer(removeNumbers))   
 a <- tm_map(a, content_transformer(removePunctuation))   
 a <- tm_map(a, content_transformer(tolower))   
 a<- tm_map(a, stripWhitespace)   
 a[[1]] <- removeWords(a[[1]], stopwords("en"))   
 a<- tm_map(a, stripWhitespace)

3) Examine the corpus and replace words if necessary. Since the Wells report was written by two parties, a lawfirm and Exponent some of the terms were inconsistent. This is how I changed "psig" to "psi" which were used interchangeably in the document:

 a[[1]]<- gsub( "psig" , "psi" , a[[1]])   
 a<- tm_map(a, PlainTextDocument) #neccessary after word replacement

4) Create term document matrix and dataframe of keywords:

 tdm<- TermDocumentMatrix(a, control = list(minWordLength = 3))   
 keywords<-tdm[[6]][[1]]   
 count<-tdm[[3]]   
 k<-data.frame(keywords,count)   
 k<-k[order(-k[,2]),]

5) Create and format word cloud:

 library(RColorBrewer) #colors wordcloud   
 library(wordcloud)   
 tdm.m <- as.matrix(tdm)   
 tdm.v <- sort(rowSums(tdm.m),decreasing=TRUE)   
 tdm.d <- data.frame(word = names(tdm.v),freq=tdm.v)   
 table(tdm.d$freq)   
 pal2 <- brewer.pal(8,"Dark2")   
 png("wells_report.png", width=8,height=8, units='in', res=400)   
 wordcloud(tdm.d$word,tdm.d$freq, scale=c(8,.2),min.freq=5, max.words=Inf,  random.order=FALSE, rot.per=.15, colors=pal2)   
 dev.off()

Data Science, Machine Learning and Predictive Analytics

R: Baseball Pitching Wins Above Replacement and Salary

R Text Mining: The Wells Report

How to Create Word Cloud using R: