Regression / Time Series BP Stock Model

The date of the oil spill in the Gulf of Mexico was April 20, 2010.  British Petroleum stock has declined 54.6% from the close of 4/20/2010 to close 6/25/10 which is 48 trading days.  Volume has gone from 3.8 million on April 20th to 93 million on June 25th with a max of 240 million on June 9th. I am going to fit a simple regression and time series model to predict BP stock price based solely on price and volume. 

Below is an ANOVA table of a regression between change in price as a dependent variable and change in volume as an independent variable.  The negative sign for change in volume indicates as volume increases, BP stock has declined over this time period and is statistically significant.


  













The residuals are auto-correlated as this is a time series.  I am going to fit an AR(2) model to the residuals of the regression.  An AR(2) turned out to be the best fit after examining various ARIMA models and GARCH / ARCH models. 





To examine the fit of the model, an estimator of variance needs to be created using the actual stock prices and the fitted values.  Mean squared error or MSE is calculated below:

For this model, MSE = 2.8.


For only looking at price changes and volume, this is a fairly accurate fit and predictor although there are a few trouble points.  Accuracy could be increased by adding to the model.  

Disclaimer:  Please note this is a demonstration and for academic use only. 

Benefits and Uses of Statistical Research

Identify Risk or Opportunity
Statistical research and data mining models can be used to identify both specific risk or opportunity to a company. Credit card companies use data mining models to identify possible fraudulent transactions or the probability of a consumer to default on a loan or miss a payment. Statistical research can also be used to identify high quality consumers that will minimize possible borrowing risks and maximize earnings.

Market Segmentation
Statistical research can be used to identify high quality consumers who are profitable to retain and low quality consumers who are not profitable to retain. High quality consumers who are at risk of canceling a service might respond to a particular marketing strategy with a higher probability rather than an alternative strategy. Statistical research and predictive modeling techniques can be used to maximize the retention of quality consumers.

Cross Selling
Companies today collect vast data concerning their customers including demographics and purchasing habits and behavior.  Statistical research and data mining models can be used to identify current consumers that are likely to purchase additional products from organizations that maintain and collect elaborate data.

Identify New Markets
Statistical research can be used to identify new markets and opportunities. A properly designed survey and sampling techniques can be used to identify consumers that are likely to purchase a brand new product and whether it would be profitable for a company to bring it to market.

Minimize Variability of a Process
Many times a company may be more concerned about the variability of a response around its mean rather than the actual mean response. Statistical research can be used to ensure homogeneity of product rather than products that are manufactured with different tolerance levels.  As an example, a company manufacturing semiconductors that need to fit into another company's motherboard would want to minimize variability in dimensions and thickness to ensure the products fit and are compatible rather than focus on the mean of product dimensions. 

Efficacy of a Process or Product using Design of Experiments
Proper experimental design including randomization, replication, and blocking (if necessary) can determine if a drug, diet, exercise program, etc. is effective versus another.  Choosing the correct design before the experiment and appropriate factors and interactions to investigate is critical.  Types of designs include completely randomized designs (CRD), CRD with blocking, split plot designs, full and partial factorial designs, etc.

Best Statistical Software Package

The best statistical software package is the tool that works best for the user based upon what needs to be done, cost, and what works best with existing software. Similar to using a particular tool, each package has its advantages and disadvantages over each other. The major packages I use are R, SPSS, and SAS.

R
R is a free open source statistical software language that is best used for writing custom statistical programs. It has the steepest learning curve but once you learn the language it is the most customizable and powerful out of the major packages. R has a very large community of statisticians who have written custom add on packages.

Here is an example of the power and flexibility of R.  I wrote a custom package that performs robust regression using iteratively reweighted least squares (IRLS).

The best way to run R is with EMACS which is a text editor and ESS (Emacs Speaks Statistics).

Advantages
Free
Fully customizable programming language
Powerful graphics
Interacts well with databases such as MYSQL
Can interact with C++ to speed up processes
Runs on Windows, Linux, and Macintosh operating systems

Disadvantages
Steepest learning curve

SPSS
SPSS is a GUI based statistical software package that is popular in the social sciences. Because it is GUI based, it is good for running quick analyses. SPSS was also recently acquired by IBM which should help the software in the long run.

Advantages
GUI based program
Quick descriptive statistics capability
Most popular package in the social sciences
Good for cluster analysis
Runs on Windows, Linux, and Macintosh operating systems

Disadvantages
Requires annual license
Limited statistical procedures vs R or SAS
Some procedures require purchase of add-on modules

PSPP is a free open source version of SPSS which can read and manipulate SPSS data files and perform most of the procedures of SPSS.

SAS
SAS is the software used in the statistical analysis of clinical pharmaceutical trials for submission to the FDA. SAS has been in existence since 1976 and was originally developed for use on mainframe computers. Eventually as personal computers became popular and faster, SAS was developed for PCs. SAS is very popular in experimental design and ANOVA.

SAS is also a programming language and allows users to write macros which are custom data steps and procedures.

Advantages
Standard package of pharmaceutical industry
Programming language is flexible although not as flexible/powerful as R
Good for experimental design and ANOVA

Disadvantages
Poor graphics capabilities although they have improved
Requires annual license
Limited statistical procedures vs R although more procedures than SPSS
Some procedures require purchase of add-on components

What is Statistical Consulting?

I am frequently asked the question what is "Statistical Consulting?" Statistical consulting is actually a very broad term as is the science of statistics.  Statistics is defined as the science of making effective use of numerical data.  A statistical consultant makes use of the science of statistics to solve a real world problem while also providing qualitative consultation and technical expertise.

The process of statistical consulting is varied depending on the research topic on hand.  A client needs to determine the research topic to explore.  Immediate issues to consider are the collection of data.  Does a survey need to be created?  How is the data going to be collected?  If consumers are going to be surveyed, a properly constructed survey and determining randomization techniques are imperative.  How many consumers must be surveyed to obtain the desired level of power?  An example could be a consumer products company that is looking to increase sales of a product line of toothpaste.  The company wants to create a marketing campaign to market to groups that are most likely to purchase the product.  Once consumers are properly surveyed the data needs to be analyzed.  A good consultant needs to be able to work with large datasets as data storage and management is an extremely important part of the process. After determining data methods, the consultant needs to be able to work with a statistical software package to manipulate the data, make inferences, and conclusions. When using a package it is important to use the package most appropriate for the organization as one package may be a better fit based on many factors.  After making conclusions, the consultant needs to be able to present and implement the results to the company and into their technology process.  Although this process is very quantitative, there is an art component as well which comes from being able to understand the goal of the client and successfully relate the science of the results.  Cost is always an issue as well with a study and determining the power.  It may not be cost effective to survey thousands of consumers and in situations like this the consultant needs to determine the best solution to maximize the accuracy of the study.

Another example is a utility company that needs to know how much natural gas they are going to need on a daily basis to heat homes in the winter time. If the company has too much supply it needs to be stored which is an added cost. Too little supply and it is possible to have a shortage. An optimal amount needs to be determined depending upon factors such as temperature, wind speed, and average household usage.

I've worked on many different types of projects in many different industries but specialize in building predictive models for decision making processes. A predictive model is a mathematical or statistical formula or process that takes a variety of inputs and allows an end user to make predictions about an event, product, or situation under certain assumptions. Predictive modeling allows a company to maximize or minimize a process and identify both opportunity and risk. Often, it is more important to identify risk than opportunity.