QUESTION: What is the effect of Autocorrelation On Statistical Process Control Charts ANSWER: Statistical methods can be very useful in summarizing information and very powerful in testing hypothesis. As in all things, there can be drawbacks. To proceed with the application of a statistical test one has to be careful about validating the assumptions under which the test is valid. It is often possible to remedy the violation and then safer to proceed. One of the most often violated assumptions is that the observations are independent. Unfortunately, the real world operates in ignorance of many statistical assumptions, which can lead to problems in analysis. The good news is that these problems may be easily overcome, so long as they are recognized and dealt with correctly. The Value of Information The assumption of independence of observations implies that the most recent data point contains no more information or value than any other data point, including the first one that was measured. In practice this means that the most recent reading has no special effect on estimating the next reading or measurement. In summary it provides no information about the next reading. If this assumption, i.e. independence of all readings, holds this implies that the overall mean or average is the best estimate of future values. If however there is some serial or autoprojective (autocorrelation) structure the best estimate of the next reading will depend on the recently observed values. The exact form of this prediction is based on the observed correlation structure. Stock market prices are an example of an autocorrelated data set. Weather patterns move slowly , thus daily temperatures have been found to be reasonably described by an AR(2) model which implies that your temperature is a weighted average of the last two days temperature. Try it and see if doesn't work ! A Hypothethical
	The problem seemed straightforward and simple to Joe.
	His boss Jill called him into his office and said. "We
	took 25 measurements at a fixed interval of time on line #1.
	We then made a change in the process and took 25 more
	measurements at the next 25 periods. We have two columns of
	numbers fifty readings in all. I want you to test if there
	is a significant difference in these two means."

	BEFORE AFTER

	1 9.3514475830 9.8966583638
	2 9.5681955479 9.9944607413
	. 9.5466960924 10.3626602774
	. 9.3969162112 10.3679648003
	. 9.7971314258 10.3122677744
	. 9.5415274115 10.3294613471
	. 9.7555390371 10.5880309702
	. 10.0000733326 10.4541222131
	. 9.6649460022 10.1278926708
	. 9.8526793092 10.2218148549
	. 10.0348204485 10.0686794649
	. 9.7034438999 10.0273011663
	. 9.4997914890 10.3240292077
	. 10.0091622391 9.8474173752
	. 10.3309466494 9.8520087711
	. 10.2246080180 9.7593820424
	. 10.0623380360 9.9021653502
	. 10.2635470167 9.8952197365
	. 9.9144741061 9.9344370445
	. 9.8057220172 10.1769312079
	. 9.6633186971 10.0315961092
	. 9.8584971928 10.3399870500
	. 9.9664655952 10.3744613997
	. 10.1030273833 10.2208646771
	25 10.0938308150 10.5819767510



	SERIES STATISTICS IN TERMS OF THE DIFFERENCED AND TRANSFORMED DATA

	BEFORE AFTER

	Differencing Orders Applied to Y NONE NONE
	Power Transformation (Lambda) on Y 1.00 1.00
	Number of Effective Observations n 25 25
	Mean of the Series Y 9.840366 10.15967
	Variance of the Series Y .7024076E-01 .5511425E-01
	Standard Deviation of the Series Y .2650297 .2347642
	Standard Error of the Mean / n .5300595E-01 .4695285E-01
	Mean divided by its Standard Error /[ / n] 185.6465 216.3803

	DIFFERENCES IN TWO MEANS 9.84 - 10.16 = -.3193

	POOLED STANDARD DEVIATION (24.070+24.055) = .255517

	STANDARD ERROR OF DIFFERENCE .25517/ npooled = .070811

	T VALUE DIFF divided by its Standard Error d/[ / n] = -4.509

	Joe went back to his office and pulled his favorite stat
	book written by one of his school professors. As he re-read
	the material on testing the hypothesis of the difference
	between two means he recalled that if the test for normality
was valid within each group then he could use a parametric
approach. Furthermore if he could accept the hypothesis of
constant variance (similar dispersion within the two groups)
he could use the standard test of the equivalence of two
means. He also noted that if the variances (dispersions)
were different he could use the same test but with an
adjusted degrees of freedom. Continuing his readings he
found that if these readings were correlated he would be
better off using the paired-t test approach. The paired
t-test reflected the presence of a correlative structure
between each pair of readings. For example:

TIRE WEAR
POS1 POS2
CAR 1 Y 1 1 Y 2 1
CAR 2 Y 1 2 Y 2 2
.. .. ..
CAR 25 Y25 1 Y25 2

Joe had 50 readings but they weren't independent as
they were taken chronologically. But they weren't correlated
in the way the textbook pointed out i.e. Y1 AND Y26 Y2 AND
Y27. There might be a relationship between Y1 and Y2 Y2 and
Y3 etc.

The text was fastidious about normality checking
otherwise you would have to use a non-parameteric test like
Wilcoxin's t and rigorous about separating the paired t-test
from the standard and even seemingly pre-possessed by
testing constancy of variance before proceeding nothing
existed whatsoever about his particular problem. Thus
assured that he could proceed because no textbook mentioned
what to do with autocorrelated data Joe whipped out his
favorite piece of PC software and went ahead with his test.


Sometime that night in a recurring dream Joe heard a
faint whisper "the residuals/errors have to be N.I.I.D.
where D means distributed". Joe said to himself I verified
the n meaning normality and I verified the i meaning
identically by verifying constant variance but what is that
other I in N.I.I.D.?

The other I that was bothering Joe means independent.
If you have data that is chronological then the degree of
dependence between successive readings is critical as it may
cause you to conclude incorrectly.

This example of textbooks totally ignoring time series
data reflects an error of omission that is slow to cure. The
textbook authors cover themselves by stating the assumptions.
Unwashed readers misuse the textbook solution because they
were not told in clear language that the approach did not
apply to their problem. Caveat emptor or caveat Joe
statistician. We attempt here to expand upon simple but
incorrect textbook solutions to cover possible time series
problem sets.

This replay studies the impact of non-random error terms
on time series data. We created four time series each of
length 50 which were generated via simulation. Simulation
starts with an error process which is used as the input.
This random input is then used with a known deterministic
model to create a realization of a series.

For example the series WN was based on the following:

We generated 300 normally distributed uncorrelated and independent
samples A(T) and then using the model

Y(T) - 10.0000 = A(T)

or

Y(T) = 10.0000 + A(T)

created the Y(T).



A second series AR was based on the following:

We generated 300 normally distributed uncorrelated and
independent samples A(T) and then using the model:

[Y(T) - 10.0000] [(1- .7B)]**+1 = A(T)

created the Y(T).

Specifically

Y(T) - 10.0000 = A(T)

[Y(T) - 10.0000] [(1- .7B)]**+1 = A(T)

[Y(T)][(1- .7B)]+1 - [10.0000] [(1- .7B)]+1 = A(T)

Y(T) - .7*Y(T-1) - 10.0000 [(1- .7)] = A(T)

Y(T) - .7*Y(T-1) - 3. = A(T)

Restating

Y(T) = .7*Y(T-1) + 3. + A(T)

thus

Y(1) = .7*Y(0) + 3. + A(1)

and since Y(0) is unknown we will use as a starting condition the
value 0.

Y(1) = 0. + 3. + A(1)

and then

Y(2) = .7*Y(1) + 3. + A(2)
etc.

After completion of these 300 we will discard the first 250
therefore eliminating the initial conditions bias of setting Y(0)=0.
In this way we generate a series of numbers which is autocorrelated.

A third series WNL was based on the following:

Y(T) - 9.7500 = A(T) + .50 * LINTAT26(T)

where LINTAT26 = 0 for T=1 2 ....25
= 1 for T=26 27 50
or

Y(T) = 9.7500 + A(T) + .50 * LINTAT26(T)

A fourth series ARL was based on the following:

[Y(T) - 10.0000] [(1- .7B)]*+1 = A(T) + .25 LINTAT26(T)

Let us look at some descriptive statistics of these four series:
First the two statistics summarizing central tendency and dispersion.


FIELD # 1 FIELD # 2 N FIELD # 3
NAME MEAN STAND

WN 10.00 50 1.00
AR 10.00 50 1.00
WNL 10.00 50 .30
ARL 10.00 50 .30



Now a statistic which measures the internal relationship within a
series:
A U T O C O R R E L A T I O N
F U N C T I O N


LAG WN WNL AR ARL


1 -.127 .695 .571 .632
2 -.137 .664 .384 .472
3 .240 .681 .345 .422
4 -.084 .548 .140 .228
5 .017 .528 .082 .163
6 -.080 .432 -.017 .061
7 .038 .380 -.017 .045
8 -.016 .352 -.041 .023



Thus we can readily see that the level shift has
contaminated the ACF so that the identification of the
underlying model is more difficult than simply looking for
decays and cutoffs. Now a second statistic which measures
the internal relationship:

P A R T I A L
A U T O C O R R E L A T I O N
F U N C T I O N

LAG WN WNL AR ARL


1 -.127 .695 .571 .632
2 -.156 .349 .086 .121
3 .209 .305 .144 .141
4 -.052 -.097 -.186 -.187
5 .067 .017 .027 .030
6 -.153 -.152 -.132 -.118
7 .064 -.008 .102 .105
8 -.072 .007 -.074 -.040


The replay illustrates bringing in these four series and
then plotting them. The next step is to test the hypothesis
of the difference between two means:

HO: MEAN OF THE FIRST 25 = MEAN OF THE SECOND 25
HA: " " <>" "

A basic statistical test presented in a first course in
Business Statistics is whether or not two means are
statistically significantly different from one another. The
student learns to compute:

1. two means and two standard deviations
2. test the equivalence of the standard deviations
and if equal compute the pooled standard deviation p

and then given the equivalence of the two standard deviations

3. test the statistical significance of the observed
difference between the two means using

Standard Error of the Mean Diff = p/ (1/n1) + (1/n2)

This is exactly equal to the linear model test for a
regression where the model is:

Y(T) = + A(T) + W0 * LINTAT26(T)

note that A(t) must be a white noise normal gaussian and
consequently each of the A's must be independent and
identically distributed. Let us look at the results of the
test of the significance for the dummy variable LINTAT26.


In this section we are assuming that the true state of
nature is an error process that is N.I.I.D. All of the
subsequent tests and tables make that assumption. We know
of course what reality is by virtue of the simulation.

WN

THE ESTIMATED MODEL PARAMETERS

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO
# (BOP) ERROR

Lambda Value 1.000000
1 [ (B)/ (B)]Y(T)=CONSTANT 9.852248 .195795 50.32

INPUT SERIES X1

Lambda Value 1.000000
2 Omega (input) -Factor # 1 0 .2955076 .276896 1.067

Y(T) = 9.8522
+ X 1(T) [(+ .2955)]
+ A(T)


WNL

THE ESTIMATED MODEL PARAMETERS

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO
# (BOP) ERROR

Lambda Value 1.000000
1 [ (B)/ (B)]Y(T)=CONSTANT 9.738417 .280222E-01 347.5

INPUT SERIES X1

Lambda Value 1.000000
2 Omega (input) -Factor # 1 0 .5237786 .396294E-01 13.22

Y(T) = 9.7384
+ X 1(T) [(+ .5238)]
+ A(T)


AR

THE ESTIMATED MODEL PARAMETERS

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO
# (BOP) ERROR

Lambda Value 1.000000
1 [ (B)/ (B)]Y(T)=CONSTANT 9.614634 .182440 52.70

INPUT SERIES X1

Lambda Value 1.000000
2 Omega (input) -Factor # 1 0 .7707375 .258009 2.987

Y(T) = 9.6146
+ X 1(T) [(+ .7707)]
+ A(T)


ARL

THE ESTIMATED MODEL PARAMETERS

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO
# (BOP) ERROR

Lambda Value 1.000000
1 [ (B)/ (B)]Y(T)=CONSTANT 9.840366 .500710E-01 196.5

INPUT SERIES X1

Lambda Value 1.000000
2 Omega (input) -Factor # 1 0 .3193062 .708110E-01 4.509

Y(T) = 9.8404
+ X 1(T) [(+ .3193)]
+ A(T)



Notice that the regression coefficient (.3193062) is the
difference between the two means (second minus first or after
minus before ) and the t-ratio (4.509) is identical to the
test of the hypothesis between the two means.

S U M M A R Y


T WN WNL AR ARL


1 1.0677 13.22 2.987 4.509



TRUE STATE OF NATURE: NO SIGNIFICANT MOVEMENT


T WN AR


1 1.0677 2.987


TRUE STATE OF NATURE: SIGNIFICANT MOVEMENT


T WNL ARL


1 13.22 4.509



Thus the effect of autocorrelated data in the absence
of a significant level shift is to cause one to reject the
null hypothesis with greater frequency than is warranted.
Thus we can conclude that autocorrelated data leads to false
positives regarding the movement of a mean if no mean
movement exists. However if the mean has significantly
moved the presence of autocorrelated data masks the movement
in the mean.

In this section we are assuming that the true state of
nature is an error process that is N.I.I.D. With

[Y(t) ] [(1- .7B)]**+1 = A(t)

note that this model of the errors was found by
empirically studying the first 25 observations of ARL.

WN

THE ESTIMATED MODEL PARAMETERS

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO
# (BOP) ERROR

Lambda Value 1.000000
1 [ (B)/ (B)]Y(T)=CONSTANT 11.48754 1.39198 8.253
2 Autoregressive-Factor # 1 1 -.1590011 .140084 -1.135

INPUT SERIES X1

Lambda Value 1.000000
3 Omega (input) -Factor # 2 0 .2344056 .232329 1.009

Y(T) = 9.9116
+ X 1(T) [(+ .2344)]
+ A(T) [(1+ .1590B)]**-1


WNL

THE ESTIMATED MODEL PARAMETERS

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO
# (BOP) ERROR

Lambda Value 1.000000
1 [ (B)/ (B)]Y(T)=CONSTANT 11.29325 1.36428 8.278
2 Autoregressive-Factor # 1 1 -.1586483 .140084 -1.133

INPUT SERIES X1

Lambda Value 1.000000
3 Omega (input) -Factor # 2 0 .5150275 .332608E-01 15.48

Y(T) = 9.7469
+ X 1(T) [(+ .5150)]
+ A(T) [(1+ .1586B)]**-1


AR

THE ESTIMATED MODEL PARAMETERS

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO
# (BOP) ERROR

Lambda Value 1.000000
1 [ (B)/ (B)]Y(T)=CONSTANT 4.307894 1.15223 3.739
2 Autoregressive-Factor # 1 1 .5686867 .119324 4.766

INPUT SERIES X1

Lambda Value 1.000000
3 Omega (input) -Factor # 2 0 .3103636 .454120 .6834

Y(T) = 9.9879
+ X 1(T) [(+ .3104)]
+ A(T) [(1- .5687B)]**-1


ARL

THE ESTIMATED MODEL PARAMETERS

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO
# (BOP) ERROR

Lambda Value 1.000000
1 [ (B)/ (B)]Y(T)=CONSTANT 4.323042 1.17370 3.683
2 Autoregressive-Factor # 1 1 .5651501 .119329 4.736

INPUT SERIES X1

Lambda Value 1.000000
3 Omega (input) -Factor # 2 0 .1946314 .123809 1.572

Y(T) = 9.9415
+ X 1(T) [(+ .1946)]
+ A(T) [(1- .5652B)]**-1


S U M M A R Y


T WN WNL AR ARL


1 1.009 15.48 .6834 1.572



TRUE STATE OF NATURE: NO SIGNIFICANT MOVEMENT


T WN AR


1 1.009 .6834


TRUE STATE OF NATURE: SIGNIFICANT MOVEMENT


T WNL ARL


1 15.48 1.572



G R A N D S U M M A R Y


TRUE STATE OF NATURE INDEPENDENT ERRORS CORRELATED ERRORS
------------------- -----------------

TRUE STATE OF NATURE NO SHIFT SHIFT NO SHIFT SHIFT
-------- ----- -------- -----
ASSUME INDEPENDENT 1.068 13.22 2.987 4.509
USE OBSERVED STRUCTURE 1.009 15.48 .6834 1.572

Recall that in the case of ARL we commented that the AR
structure in the case where modeling was not done after each
reading could cause a shift in the mean that would go
undetected due to the responsiveness of the one period out
forecasts to the new reading. The replay estimates the AR1
model for the first 25 readings of ARL and then proceeds to
compute updated forecasts which illustrate the point. We
illustrate this last point by using Autobox to estimate the
model for ARL using the first 25 observations

Y(T) = 9.8950 + A(T) [(1- .5244B)]**-1


and then forecasting the 26th point. We then
incorporate the actual value for the 26^th point and forecast
the 27th point.

OUT OF SAMPLE FORECAST VALUES ARL

TIME PERIOD
26 27 28 29 30 31 32 33 34 35
A C T U A L S
9.90 9.99 10.4 10.4 10.3 10.3 10.6 10.5 10.1 10.2

FORECAST
ORIGIN
25 10.0
26 9.90
27 9.95
28 10.1
29 10.1
30 10.1
31 10.1
32
10.3
33
10.2
34
10.0


Notice how the forecasts adapt to the level change and
thus mask the special effect of the level shift at time period
26. This flaw was pointed out by Thomas P. Ryan and is
easily remedied by remodeling after each and every new
observation has been recorded.