SPURIOUS CORRELATION & SIMPSON'S PARADOX

SPURIOUS CORRELATION AND IT'S DETECTION

Outliers and mean shifts are commonly encountered in time series analysis. The presence of extraordinary events mislead conventional time series analysts resulting in erroneous conclusions. The impact of these events are often overlooked for the lack of a simple yet effective means to incorporate these isolated events. Autobox performs this task and is thus an "effective means".

We first illustrate the effect of unknown events which cause model identification to go awry. Standard identification of Arima models uses the sample ACF as one of the two vehicles for model identification. The ACF is computed using the autocovariance and the variance. An outlier distorts both of these statistics and in effect dampens the ACF by inflating both measures.

ACF = AUTOCOVARIANCE/VARIANCE

In causal models where the cross-correlation function (CCF) is the tool for identifying the form of the relationship, we have a similar problem. Outliers in either of the two series inflate the cross-covariance. The net effect is to conclude that the CCF is flat; and the resulting conclusion is that no information from this potential cause variable is useful. It is necessary to check that no isolated event or sequence of multiple events has inflated these measures leading to an "Alice in Wonderland" conclusion where everything appears rosy and no coorelation is incorrectly concluded. The effect of outliers is identical in both ARIMA and Causal Models as they inflate the denominator to a larger degree than the numerator thus driving the ratio to zero. If not handled properly these outliers will lead to an under-identified model.

Regression estimates are based upon minimizing the vertical sum of squares, i.e. the errors in predicting the Y variable. If unusual values, either in Y or X, are present then the normal estimators may be biased. Nonparametric, meaning distributional free, methods have been developed to estimate parameters that are minimally affected by unusual values. Modern time series analysis approach yields estimated parameters which are also "robust", but can parametrically modelled.

A plot of the data.

Regression estimates are based upon minimizing the vertical sum of squares, i.e. the errors in predicting the Y variable. If unusual values, either in Y or X, are present then the normal estimators may be biased. Non-parametric, meaning distributional free, methods have been developed to estimate parameters that are minimally affected by unusual values.

The two unusual values (first and last) distort the correlation and cloud the true association. The estimated correlation is .079.

The effect of the outliers is to distort the estimated parameters.

Outlier detection or Intervention Detection is functionally equivalent to non-parametric regression when we restrict Intervention Detection to pulses. In many cases the pulses are systematic, i.e. every s periods or grouped together in time so that collectively they violate the gaussian assumptions. Essentially, a sequence of outliers may in and of themselves not be significant, but collectively they may indicate a non-randomness. Intervention Detection sorts out the nature of the of the outliers. Again a consecutive sequence of outliers that have approximately the same magnitude and direction are collectively referred to as a step or level shift.

In this case two outliers are found , the first at time period 1 and the second at time period 10.

The final equation shows the estimates of the two pulses at time periods 1 and 10 . By identifying these interventions we obtain a more accurate representation of the underlying relationship between Consumption (Y) and Income (X).

The data.

CLICK HERE:Home Page For AUTOBOX