Word Cloud Wells Report |
Story: Patriots Deflated Footballs
Document to be text mined: Wells Report
How to Create Word Cloud using R:
Packages needed:tm: Text Mining Package
RColorBrewer: ColorBrewer Palettes
wordcloud: Word Clouds
1) Convert PDF to text file using pdftotext
2) Clean the document and remove numbers, punctuation, symbols, and stop words.
library(tm) #text mining
source<-DirSource("~/Text") #save text file(s) here
a<-Corpus(source, readerControl=list(reader=readPlain))
a <- tm_map(a, content_transformer(removeNumbers))
a <- tm_map(a, content_transformer(removePunctuation))
a <- tm_map(a, content_transformer(tolower))
a<- tm_map(a, stripWhitespace)
a[[1]] <- removeWords(a[[1]], stopwords("en"))
a<- tm_map(a, stripWhitespace)
3) Examine the corpus and replace words if necessary. Since the Wells report was written by two parties, a lawfirm and Exponent some of the terms were inconsistent. This is how I changed "psig" to "psi" which were used interchangeably in the document: a[[1]]<- gsub( "psig" , "psi" , a[[1]])
a<- tm_map(a, PlainTextDocument) #neccessary after word replacement
4) Create term document matrix and dataframe of keywords: tdm<- TermDocumentMatrix(a, control = list(minWordLength = 3))
keywords<-tdm[[6]][[1]]
count<-tdm[[3]]
k<-data.frame(keywords,count)
k<-k[order(-k[,2]),]
5) Create and format word cloud: library(RColorBrewer) #colors wordcloud
library(wordcloud)
tdm.m <- as.matrix(tdm)
tdm.v <- sort(rowSums(tdm.m),decreasing=TRUE)
tdm.d <- data.frame(word = names(tdm.v),freq=tdm.v)
table(tdm.d$freq)
pal2 <- brewer.pal(8,"Dark2")
png("wells_report.png", width=8,height=8, units='in', res=400)
wordcloud(tdm.d$word,tdm.d$freq, scale=c(8,.2),min.freq=5, max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()