QUESTION:

 

What is the effect of Autocorrelation On Statistical Process Control Charts

 

ANSWER:

 

Statistical methods can be very useful in summarizing information and very powerful in testing hypothesis. As in all things, there can be drawbacks. To proceed with the application of a statistical test one has to be careful about validating the assumptions under which the test is valid. It is often possible to remedy the violation and then safer to proceed. One of the most often violated assumptions is that the observations are independent. Unfortunately, the real world operates in ignorance of many statistical assumptions, which can lead to problems in analysis. The good news is that these problems may be easily overcome, so long as they are recognized and dealt with correctly.

The Value of Information

The assumption of independence of observations implies that the most recent data point contains no more information or value than any other data point, including the first one that was measured. In practice this means that the most recent reading has no special effect on estimating the next reading or measurement. In summary it provides no information about the next reading. If this assumption, i.e. independence of all readings, holds this implies that the overall mean or average is the best estimate of future values. If however there is some serial or autoprojective (autocorrelation) structure the best estimate of the next reading will depend on the recently observed values. The exact form of this prediction is based on the observed correlation structure. Stock market prices are an example of an autocorrelated data set. Weather patterns move slowly , thus daily temperatures have been found to be reasonably described by an AR(2) model which implies that your temperature is a weighted average of the last two days temperature. Try it and see if doesn't work !

A Hypothethical

The problem seemed straightforward and simple to Joe.

His boss Jill called him into his office and said. "We

took 25 measurements at a fixed interval of time on line #1.

We then made a change in the process and took 25 more

measurements at the next 25 periods. We have two columns of

numbers fifty readings in all. I want you to test if there

is a significant difference in these two means."

 

BEFORE AFTER

 

1 9.3514475830 9.8966583638

2 9.5681955479 9.9944607413

. 9.5466960924 10.3626602774

. 9.3969162112 10.3679648003

. 9.7971314258 10.3122677744

. 9.5415274115 10.3294613471

. 9.7555390371 10.5880309702

. 10.0000733326 10.4541222131

. 9.6649460022 10.1278926708

. 9.8526793092 10.2218148549

. 10.0348204485 10.0686794649

. 9.7034438999 10.0273011663

. 9.4997914890 10.3240292077

. 10.0091622391 9.8474173752

. 10.3309466494 9.8520087711

. 10.2246080180 9.7593820424

. 10.0623380360 9.9021653502

. 10.2635470167 9.8952197365

. 9.9144741061 9.9344370445

. 9.8057220172 10.1769312079

. 9.6633186971 10.0315961092

. 9.8584971928 10.3399870500

. 9.9664655952 10.3744613997

. 10.1030273833 10.2208646771

25 10.0938308150 10.5819767510

 

 

 

SERIES STATISTICS IN TERMS OF THE DIFFERENCED AND TRANSFORMED DATA

 

BEFORE AFTER

 

Differencing Orders Applied to Y NONE NONE

Power Transformation (Lambda) on Y 1.00 1.00

Number of Effective Observations n 25 25

Mean of the Series Y 9.840366 10.15967

Variance of the Series Y .7024076E-01 .5511425E-01

Standard Deviation of the Series Y .2650297 .2347642

Standard Error of the Mean / n .5300595E-01 .4695285E-01

Mean divided by its Standard Error /[ / n] 185.6465 216.3803

 

DIFFERENCES IN TWO MEANS 9.84 - 10.16 = -.3193

 

POOLED STANDARD DEVIATION (24*.070+24*.055) = .255517

 

STANDARD ERROR OF DIFFERENCE .25517/ npooled = .070811

 

T VALUE DIFF divided by its Standard Error d/[ / n] = -4.509

 

Joe went back to his office and pulled his favorite stat

book written by one of his school professors. As he re-read

the material on testing the hypothesis of the difference

between two means he recalled that if the test for normality

was valid within each group then he could use a parametric

approach. Furthermore if he could accept the hypothesis of

constant variance (similar dispersion within the two groups)

he could use the standard test of the equivalence of two

means. He also noted that if the variances (dispersions)

were different he could use the same test but with an

adjusted degrees of freedom. Continuing his readings he

found that if these readings were correlated he would be

better off using the paired-t test approach. The paired

t-test reflected the presence of a correlative structure

between each pair of readings. For example:

 

TIRE WEAR

POS1 POS2

CAR 1 Y 1 1 Y 2 1

CAR 2 Y 1 2 Y 2 2

.. .. ..

CAR 25 Y25 1 Y25 2

 

Joe had 50 readings but they weren't independent as

they were taken chronologically. But they weren't correlated

in the way the textbook pointed out i.e. Y1 AND Y26 Y2 AND

Y27. There might be a relationship between Y1 and Y2 Y2 and

Y3 etc.

 

The text was fastidious about normality checking

otherwise you would have to use a non-parameteric test like

Wilcoxin's t and rigorous about separating the paired t-test

from the standard and even seemingly pre-possessed by

testing constancy of variance before proceeding nothing

existed whatsoever about his particular problem. Thus

assured that he could proceed because no textbook mentioned

what to do with autocorrelated data Joe whipped out his

favorite piece of PC software and went ahead with his test.

 

 

Sometime that night in a recurring dream Joe heard a

faint whisper "the residuals/errors have to be N.I.I.D.

where D means distributed". Joe said to himself I verified

the n meaning normality and I verified the i meaning

identically by verifying constant variance but what is that

other I in N.I.I.D.?

 

The other I that was bothering Joe means independent.

If you have data that is chronological then the degree of

dependence between successive readings is critical as it may

cause you to conclude incorrectly.

 

This example of textbooks totally ignoring time series

data reflects an error of omission that is slow to cure. The

textbook authors cover themselves by stating the assumptions.

Unwashed readers misuse the textbook solution because they

were not told in clear language that the approach did not

apply to their problem. Caveat emptor or caveat Joe

statistician. We attempt here to expand upon simple but

incorrect textbook solutions to cover possible time series

problem sets.

 

This replay studies the impact of non-random error terms

on time series data. We created four time series each of

length 50 which were generated via simulation. Simulation

starts with an error process which is used as the input.

This random input is then used with a known deterministic

model to create a realization of a series.

 

For example the series WN was based on the following:

 

We generated 300 normally distributed uncorrelated and independent

samples A(T) and then using the model

 

Y(T) - 10.0000 = A(T)

 

or

 

Y(T) = 10.0000 + A(T)

 

created the Y(T).

 

 

 

A second series AR was based on the following:

 

We generated 300 normally distributed uncorrelated and

independent samples A(T) and then using the model:

 

[Y(T) - 10.0000] [(1- .7B)]**+1 = A(T)

 

created the Y(T).

 

Specifically

 

Y(T) - 10.0000 = A(T)

 

[Y(T) - 10.0000] [(1- .7B)]**+1 = A(T)

 

[Y(T)][(1- .7B)]**+1 - [10.0000] [(1- .7B)]**+1 = A(T)

 

Y(T) - .7*Y(T-1) - 10.0000 [(1- .7)] = A(T)

 

Y(T) - .7*Y(T-1) - 3. = A(T)

 

Restating

 

Y(T) = .7*Y(T-1) + 3. + A(T)

 

thus

 

Y(1) = .7*Y(0) + 3. + A(1)

 

and since Y(0) is unknown we will use as a starting condition the

value 0.

 

Y(1) = 0. + 3. + A(1)

 

and then

 

Y(2) = .7*Y(1) + 3. + A(2)

etc.

 

After completion of these 300 we will discard the first 250

therefore eliminating the initial conditions bias of setting Y(0)=0.

In this way we generate a series of numbers which is autocorrelated.

 

A third series WNL was based on the following:

 

Y(T) - 9.7500 = A(T) + .50 * LINTAT26(T)

 

where LINTAT26 = 0 for T=1 2 ....25

= 1 for T=26 27 50

or

 

Y(T) = 9.7500 + A(T) + .50 * LINTAT26(T)

 

A fourth series ARL was based on the following:

 

[Y(T) - 10.0000] [(1- .7B)]**+1 = A(T) + .25 * LINTAT26(T)

 

Let us look at some descriptive statistics of these four series:

First the two statistics summarizing central tendency and dispersion.

 

 

FIELD # 1 FIELD # 2 N FIELD # 3

NAME MEAN STAND

 

WN 10.00 50 1.00

AR 10.00 50 1.00

WNL 10.00 50 .30

ARL 10.00 50 .30

 

 

 

Now a statistic which measures the internal relationship within a

series:

A U T O C O R R E L A T I O N

F U N C T I O N

 

 

LAG WN WNL AR ARL

 

 

1 -.127 .695 .571 .632

2 -.137 .664 .384 .472

3 .240 .681 .345 .422

4 -.084 .548 .140 .228

5 .017 .528 .082 .163

6 -.080 .432 -.017 .061

7 .038 .380 -.017 .045

8 -.016 .352 -.041 .023

 

 

 

Thus we can readily see that the level shift has

contaminated the ACF so that the identification of the

underlying model is more difficult than simply looking for

decays and cutoffs. Now a second statistic which measures

the internal relationship:

 

P A R T I A L

A U T O C O R R E L A T I O N

F U N C T I O N

 

LAG WN WNL AR ARL

 

 

1 -.127 .695 .571 .632

2 -.156 .349 .086 .121

3 .209 .305 .144 .141

4 -.052 -.097 -.186 -.187

5 .067 .017 .027 .030

6 -.153 -.152 -.132 -.118

7 .064 -.008 .102 .105

8 -.072 .007 -.074 -.040

 

 

The replay illustrates bringing in these four series and

then plotting them. The next step is to test the hypothesis

of the difference between two means:

 

HO: MEAN OF THE FIRST 25 = MEAN OF THE SECOND 25

HA: " " <>" "

 

A basic statistical test presented in a first course in

Business Statistics is whether or not two means are

statistically significantly different from one another. The

student learns to compute:

 

1. two means and two standard deviations

2. test the equivalence of the standard deviations

and if equal compute the pooled standard deviation p

 

and then given the equivalence of the two standard deviations

 

3. test the statistical significance of the observed

difference between the two means using

 

Standard Error of the Mean Diff = p/ (1/n1) + (1/n2)

 

This is exactly equal to the linear model test for a

regression where the model is:

 

Y(T) = + A(T) + W0 * LINTAT26(T)

 

note that A(t) must be a white noise normal gaussian and

consequently each of the A's must be independent and

identically distributed. Let us look at the results of the

test of the significance for the dummy variable LINTAT26.

 

 

In this section we are assuming that the true state of

nature is an error process that is N.I.I.D. All of the

subsequent tests and tables make that assumption. We know

of course what reality is by virtue of the simulation.

 

WN

 

THE ESTIMATED MODEL PARAMETERS

 

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO

# (BOP) ERROR

 

Lambda Value 1.000000

1 [ (B)/ (B)]Y(T)=CONSTANT 9.852248 .195795 50.32

 

INPUT SERIES X1

 

Lambda Value 1.000000

2 Omega (input) -Factor # 1 0 .2955076 .276896 1.067

 

Y(T) = 9.8522

+ X 1(T) [(+ .2955)]

+ A(T)

 

 

WNL

 

THE ESTIMATED MODEL PARAMETERS

 

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO

# (BOP) ERROR

 

Lambda Value 1.000000

1 [ (B)/ (B)]Y(T)=CONSTANT 9.738417 .280222E-01 347.5

 

INPUT SERIES X1

 

Lambda Value 1.000000

2 Omega (input) -Factor # 1 0 .5237786 .396294E-01 13.22

 

Y(T) = 9.7384

+ X 1(T) [(+ .5238)]

+ A(T)

 

 

AR

 

THE ESTIMATED MODEL PARAMETERS

 

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO

# (BOP) ERROR

 

Lambda Value 1.000000

1 [ (B)/ (B)]Y(T)=CONSTANT 9.614634 .182440 52.70

 

INPUT SERIES X1

 

Lambda Value 1.000000

2 Omega (input) -Factor # 1 0 .7707375 .258009 2.987

 

Y(T) = 9.6146

+ X 1(T) [(+ .7707)]

+ A(T)

 

 

ARL

 

THE ESTIMATED MODEL PARAMETERS

 

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO

# (BOP) ERROR

 

Lambda Value 1.000000

1 [ (B)/ (B)]Y(T)=CONSTANT 9.840366 .500710E-01 196.5

 

INPUT SERIES X1

 

Lambda Value 1.000000

2 Omega (input) -Factor # 1 0 .3193062 .708110E-01 4.509

 

Y(T) = 9.8404

+ X 1(T) [(+ .3193)]

+ A(T)

 

 

 

Notice that the regression coefficient (.3193062) is the

difference between the two means (second minus first or after

minus before ) and the t-ratio (4.509) is identical to the

test of the hypothesis between the two means.

 

S U M M A R Y

 

 

T WN WNL AR ARL

 

 

1 1.0677 13.22 2.987 4.509

 

 

 

TRUE STATE OF NATURE: NO SIGNIFICANT MOVEMENT

 

 

T WN AR

 

 

1 1.0677 2.987

 

 

TRUE STATE OF NATURE: SIGNIFICANT MOVEMENT

 

 

T WNL ARL

 

 

1 13.22 4.509

 

 

 

Thus the effect of autocorrelated data in the absence

of a significant level shift is to cause one to reject the

null hypothesis with greater frequency than is warranted.

Thus we can conclude that autocorrelated data leads to false

positives regarding the movement of a mean if no mean

movement exists. However if the mean has significantly

moved the presence of autocorrelated data masks the movement

in the mean.

 

In this section we are assuming that the true state of

nature is an error process that is N.I.I.D. With

 

[Y(t) ] [(1- .7B)]**+1 = A(t)

 

note that this model of the errors was found by

empirically studying the first 25 observations of ARL.

 

WN

 

THE ESTIMATED MODEL PARAMETERS

 

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO

# (BOP) ERROR

 

Lambda Value 1.000000

1 [ (B)/ (B)]Y(T)=CONSTANT 11.48754 1.39198 8.253

2 Autoregressive-Factor # 1 1 -.1590011 .140084 -1.135

 

INPUT SERIES X1

 

Lambda Value 1.000000

3 Omega (input) -Factor # 2 0 .2344056 .232329 1.009

 

Y(T) = 9.9116

+ X 1(T) [(+ .2344)]

+ A(T) [(1+ .1590B)]**-1

 

 

WNL

 

THE ESTIMATED MODEL PARAMETERS

 

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO

# (BOP) ERROR

 

Lambda Value 1.000000

1 [ (B)/ (B)]Y(T)=CONSTANT 11.29325 1.36428 8.278

2 Autoregressive-Factor # 1 1 -.1586483 .140084 -1.133

 

INPUT SERIES X1

 

Lambda Value 1.000000

3 Omega (input) -Factor # 2 0 .5150275 .332608E-01 15.48

 

Y(T) = 9.7469

+ X 1(T) [(+ .5150)]

+ A(T) [(1+ .1586B)]**-1

 

 

AR

 

THE ESTIMATED MODEL PARAMETERS

 

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO

# (BOP) ERROR

 

Lambda Value 1.000000

1 [ (B)/ (B)]Y(T)=CONSTANT 4.307894 1.15223 3.739

2 Autoregressive-Factor # 1 1 .5686867 .119324 4.766

 

INPUT SERIES X1

 

Lambda Value 1.000000

3 Omega (input) -Factor # 2 0 .3103636 .454120 .6834

 

Y(T) = 9.9879

+ X 1(T) [(+ .3104)]

+ A(T) [(1- .5687B)]**-1

 

 

ARL

 

THE ESTIMATED MODEL PARAMETERS

 

MODEL COMPONENT LAG COEFFICIENT STANDARD T-RATIO

# (BOP) ERROR

 

Lambda Value 1.000000

1 [ (B)/ (B)]Y(T)=CONSTANT 4.323042 1.17370 3.683

2 Autoregressive-Factor # 1 1 .5651501 .119329 4.736

 

INPUT SERIES X1

 

Lambda Value 1.000000

3 Omega (input) -Factor # 2 0 .1946314 .123809 1.572

 

Y(T) = 9.9415

+ X 1(T) [(+ .1946)]

+ A(T) [(1- .5652B)]**-1

 

 

S U M M A R Y

 

 

T WN WNL AR ARL

 

 

1 1.009 15.48 .6834 1.572

 

 

 

TRUE STATE OF NATURE: NO SIGNIFICANT MOVEMENT

 

 

T WN AR

 

 

1 1.009 .6834

 

 

TRUE STATE OF NATURE: SIGNIFICANT MOVEMENT

 

 

T WNL ARL

 

 

1 15.48 1.572

 

 

 

G R A N D S U M M A R Y

 

 

TRUE STATE OF NATURE INDEPENDENT ERRORS CORRELATED ERRORS

------------------- -----------------

 

TRUE STATE OF NATURE NO SHIFT SHIFT NO SHIFT SHIFT

-------- ----- -------- -----

ASSUME INDEPENDENT 1.068 13.22 2.987 4.509

USE OBSERVED STRUCTURE 1.009 15.48 .6834 1.572

 

Recall that in the case of ARL we commented that the AR

structure in the case where modeling was not done after each

reading could cause a shift in the mean that would go

undetected due to the responsiveness of the one period out

forecasts to the new reading. The replay estimates the AR1

model for the first 25 readings of ARL and then proceeds to

compute updated forecasts which illustrate the point. We

illustrate this last point by using Autobox to estimate the

model for ARL using the first 25 observations

 

Y(T) = 9.8950 + A(T) [(1- .5244B)]**-1

 

 

and then forecasting the 26th point. We then

incorporate the actual value for the 26th point and forecast

the 27th point.

 

OUT OF SAMPLE FORECAST VALUES ARL

 

TIME PERIOD

26 27 28 29 30 31 32 33 34 35

A C T U A L S

9.90 9.99 10.4 10.4 10.3 10.3 10.6 10.5 10.1 10.2

 

FORECAST

ORIGIN

25 10.0

26 9.90

27 9.95

28 10.1

29 10.1

30 10.1

31 10.1

32

10.3

33

10.2

34

10.0

 

 

Notice how the forecasts adapt to the level change and

thus mask the special effect of the level shift at time period

26. This flaw was pointed out by Thomas P. Ryan and is

easily remedied by remodeling after each and every new

observation has been recorded.