QUESTION:

 
Most of your competitors market products that do outlier detection and some sort of adjustment. What is the big deal with
 
AUTOBOX in this regard ?
 

ANSWER:

 
The answer to this question hinges on how we compute the standard deviation for the purposes of determining statistical
 
significance. Consider that the standard deviation of the series could be used to normalize the series to a 0,1
 
distribution. The standard deviation as computed by our competitors is based upon the sum of squares of the values
 
around the MEAN. The formula requires that one compute the sum of squares around the EXPECTED VALUE. The
 
MEAN is the expected value if an d only if the series is UNCORRELATED or independent of previous values. AUTOBOX
 
correctly computes the standard deviation, thus correctly assesses unusual values. Consider the
 
series 1,9,1,9,1,9,9,9,1,9 . The usual, and in this case incorrect , computation of the standard deviation would cause a
 
failure to detect the anomaly. Replacing the MEAN with the EXPECTED VALUE [Y(t-2)] would result in a standard
 
deviation that is smaller ( a lot smaller ! ) and thus the anomaly in the 7t h value would be immediately identified.
 
Consider a graphical presentation of the OUTLIER series 1,1,9,1,9,9,9,1,9 and reflect on the visually obvious !
 
VIEW OUTLIER SERIES
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Now, if we were to compute the standard deviation (4.13) in the traditional way
 
we would get the following 80% Confidence Limits
 
 
 
VIEW OUTLIER SERIES
 
 
 
Consider another example. This time of an INLIER. 1,9,1,9,1,9,5,9,1,9 and reflect on the visually obvious !
 
VIEW OUTLIER SERIES
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Now, if we were to compute the standard deviation (3.97) in the traditional way we would get the following
 
80% Confidence Limits
 
VIEW OUTLIER SERIES
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The problem or opportunity in both of these cases is to consider the impact of autocorrelated data on basic
 
statistical calculations and their impact or mostly their lack of impact in adequately describing statistical
 
processes. Another way to consider the effect of autocorrelation is to consider the idea that the RANGE
 
(max value - min value ) is expected to be approximately 6 times the standard deviation. This is only true for
 
Gaussian series with independent values. The t rue relationship of the RANGEand the STANDARD DEVIATION
 
depends on the autocorrelative structure in the series. The implications are clear ! The idea that one can mark off
 
plus or minus k sigma around the mean and use it as a device to talk about confidence intervals or limits assumes
 
independence. If one has time series data then on average one has autocorrelated data, thus the simple computation
 
may be unwise and our no tion of the range might have to be rearranged.
 



 

A RELATED QUESTION:

 
I want to average the 1-second output of a satellite positioning system for 24 hours (86,400 data points). This gives a
 
good estimate of the true position. The question is, how accurate is the resulting average? Recalling my statistics
 
class, I computed the standard deviation of the 86,400 points and divided it by sqrt(86,400) to get the standard error
 
of the mean position. The resulting value is far too optimistic, I think. The measurement errors are partly Gaussian
 
and partly due to a number of biases that change slowly over a period of minutes. So obviously there are not 86,400
 
independent measurements. Question 1: How does one characterize the variance of data with such slowly changing
 
random biases? Question 2: What is the accuracy to be expected from averaging such data over a specified length of
 
time? Thanks, David Hindman dhindman@io.com
 

ANSWER: As you logically concluded your estimated standard deviation is overly optimistic. This is due to the fact

 

that you do not have independent samples. The 86,400 values represent a set of autocorrelated values and thus we must

 

discuss time series analysis. I would think that an ARIMA model suitably extended to include pulses, level shifts,

 

seasonal pulses and local time trends might paint a more realistic picture. Care must be taken to insure that the

 

variance of the errors is constant or at least not proven to be different and furthermore that the estimated parameters

 

are invariant over time. Given these conditions one might be able to estimate the standard deviation of the constant in

 

the appropriate Transfer Function model. The heart of the matter is that the standard deviation is defined with respect

 

to the EXPECTED VALUE not necessarily the mean.