May 02, 2018
KPIs Big Data Analysis Rabbit
When we think of analytics, we sometimes think of a time series graph — a glimpse of data over a given period. But that data needs some statistical rigor when we plan to use it in an advanced analytics model.
The time series you see in an analytics solution is great for identifying an overall sense of what is happening online — the volume of referral search traffic, or seeing which sources are the top ones creating site conversions. But a statistical time series represents the causal impact of data — establishing behavioral consistency essential to a robust forecast model to predict sales influences, for example.
For this post, I will examine how to review a time series for an advanced model using R programming, though you could follow the same concepts in another language such as Python. I will first explain the general process, then explain the corresponding steps in R so you can understand what to expect.
To start, an analyst will need to consider the data type being used. Most time series data will be numerical – in short, data with distinct values.
The dataset can be evaluated for seasonality with a decomposition, then inspect the data with the seasonality removed. A decomposition separates the seasonal data and random error from the main trend.
The next step is examining the stationarity in the time series data. Stationarity is a condition in which the mean variance and the autocorrelation in the data are treated as a constant value. This step excludes time as an influence on the data, as well as seasonality.
The statistical test for this condition is the Augmented Dickey Fuller (ADF) test, which reveals if a time series indicates a unit-root – a metric that confirms that the series is non-stationary.
The third step is to examine autocorrelation in the data. Autocorrelation means no correlation between residual error terms. In essence, it addresses how well the times series data can be incorporated into a model without introducing error.
To check the autocorrelation, you examine the autoregression (AR) and moving average (MA) in the data. Examining these allow the analysts to apply an Autoregressive Integrated Moving Average (ARIMA) protocol. This ultimately drives the decisions for using the time series data in a forecast, a regression, or an advanced machine learning model. Stationary and autocorrelation determines the appropriate method to handle the trend or seasonality within a forecast model.
Two results in autocorrelation can lead to a choice to selecting the most accurate result. To do this we use the metric Mean Absolute Percentage Error (MAPE), examining the naive mean square error to detect model accuracy. The model with the best MAPE when compared to the original dataset is one to use for a model.
Now let’s look at each step as R programming code. To explain each step can get a bit complex.
First, data needs to be in a type of object, a numeric vector. You can import data from an API or file and create a vector. R uses a specific object function for time series called ts() which converts that vector into a specific data frame – an object that holds the data fields. Parameters add detail associated with the time series data.
The next steps involve libraries. A library in R is a file containing a set of specialized statistical functions (A set of functions in a library are called packages).
Many R programming libraries combine different functions, and sometimes incorporate other libraries to handle data visualization or to provide sample data. So an analysis can consist of several library files to complete a statistical analysis.
To research a library, go the Comprehensive R Archive Network. This Wikipedia-like site lists packages in alphabetical order. Click on a package to reveals library details and a link to download. Each package has a reference manual, a pdf document that explains each function in a package. You can review the document to understand what function is installed within a package.
For conducting the ADF test, users can apply the tseries library, invoking the function adf(x) where x is a time series data frame. It tests for the null that the time series has a unit root.
For the auto correlation, the Forecast package can be used to invoke the function acf(x) where x is the time series being examined under AR. The function pacf(x) looks at the time series under MA. The package contains a straight-forward application of ARIMA, as well as the acf, and pacf functions. R studio or Visual Studio can produce a pacf graph like this one for the user to view the function results.
Finally, the MLMetrics library can then be used to apply the MAPE test, using the original data set and the differentiated data sets as parameters.
After these steps, the time series has been statistically reviewed, with the results available for a regression analysis or a marketing model mix (I’ll explore these models more fully in an upcoming post). These steps are really helpful if the time series data is digital marketing related, such as the daily clicks on an ad, daily conversion rate, or daily reach of a post.
Overall time series data is not very complex, but understanding the statistics applied to it can start your team on a journey to appreciating advanced models such as regressions, marketing modeling mix, and deep learning model. That appreciation can keep your business in the race for better technology from data.