QUESTION:
Why and How do we Cleanse or Filter data OR Find
the Unusual Value(s)?
ANSWER: To eliminate values that are not
representative of the process that we are attempting to measure or
describe.
QUESTION:
What does an unusual value do to
model parameters ?
ANSWER:
Parameter
estimation hinges on finding the solution that minimizes the error
sum of squares. Incorrect
parameter estimation can be avoided by robust procedures. However,
Robust Regression only deals with one-time values, i.e. pulses. In the
absence of robust procedures, early efforts in applied statistics tried
to perform preliminary screening, but often reported failure. General
and largely unsatisfactory remarks on unusual values tended to
diminish the magnitude of the problem through errors of omission. Time
series data offers opportunities, and pitfalls not found in
cross-sectional data.
QUESTION:
What do we mean by robust?
ANSWER:
Both the outlier (far from expectation) and the
inlier (too usual looking) are exceptional or unusual values . Not
necessarily because of their size or magnitude, but because they are
inconsistent with a reasonable prediction.
Thus to determine an unusual value one must first determine what a usual
value is.
When computing a mean and using it to describe or summarize a sequence
of data values, we are advised that the mean may not be truly
representative of "central tendency" as it can be unduly
effected by unusual values. We are told that a more a robust estimate of
"central tendency" would be the "median value". As a
case in point, consider the "mean income of baseball players"
and how it is effected by "unusual values".
In a sentence, a "robust estimator" or "a method of
robust estimation"
is one that is relatively "unaffected by violations of the
assumptions" under which the estimation took place.
What is the standard
approach? Unknowingly, most procedures assume the model or
prediction equation: Y(t) = U + A(t) and
compute the standard deviation of A's, which is identical to the
standard deviation of the original Y series. They then proclaim that a
central band of acceptable values, i.e. usual values will fall in the
range of U +/- K*SIGMA
where U is the mean and SIGMA is the estimated standard
deviation using standard computations. Values outside this range are
considered unusual.
POSSIBLE PROBLEMS (OR OPPORTUNITIES):
We will consider a number of possible circumstances where the actual
data suggest a different underlying model thus masking unusual values.
It is possible to envision cases where the opposite would occur, i.e.
unusual values were caused by the bad model. This can happen when
inliers arise thus deflating the standard deviation thus causing one to
under-estimate the acceptable range thus discarding values that should
not have been discarded.
COUNTER EXAMPLE #1:
Assume that the
observed data arises in such a way as the following model or
prediction equation approximately holds: Y(t)
= b0+ b1*T + A(t) where T is a Trend Series ( 1,2,3,4,,,,,,,t).
If we use the standard deviation of the original series as the estimate
of the allowable range we are clearly going to miss out on a lot of
unusual values. However, if we compute the standard deviation of the
residuals from the expected value than a much clearer picture of the
allowable range comes into place. In practice, we then Y(t)
- b0+ b1*T = A(t) estimate the A's and compute the standard
deviation from those values. Perhaps a review of the standard
deviation might help. When the average or mean changes level some
statistical forecasting tools get confused and declare that a trend is
present. Consider [1]
and compare it to [2].
When shown together
the ambiguity dissolves. Consider another example
of a sales series. Some of the unusual values are distinctly usual.
Aside from the single pulse and the seasonal pulses the series
could be adequately described as a sequence of level shifts. Summarizing
the analysis visually
as compared or contrasted to a
trend model.
COUNTER EXAMPLE #2:
Consider the series 1,9,1,9,1,9,9,9,1,9.
The usual computation of the standard deviation would cause a failure to
detect the anomaly. Replacing the MEAN with the EXPECTED VALUE [Y(t-2)]
would result in a standard deviation that is smaller (a lot smaller !)
and thus the anomaly in the 7th value would be immediately identified.
Consider a graphical presentation of the OUTLIER
series 1,9,1,9,1,9,9,9,1,9 . Note
that if one were to estimate the model: Y(t)
= b0+ b1*Y(t-2) + A(t) the least squares estimate of b1 would be
effected by the anomaly, but the unusual value at time period seven
would still be clearly identifiable.
COUNTER EXAMPLE #3:
Consider another example. This time of an INLIER.
1,9,1,9,1,9,5,9,1,9 .
COUNTER EXAMPLE #4:
Sometimes due to predictable isolated
events such as a plant closing or a large order that arises say
every November there are unusual values that are and should be
considered usual. Another way of saying this is that one analyst's noise
is another analyst's signal. Y(t) = b0+ b1*SP(t)
+ A(t) t SP(t) 1 0 JAN 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 .
9 0 . 10 1 NOV. 11 0 12 0 13 0 JAN 14 0 . 15 0 . 16 0 . 17 0 . 18 0 . 19
0 . 20 0 . 21 0 . 22 1 NOV. ... 0 where SP(t) = 0 except for November
where it is equal to 1. Thus to filter or detect unusual values we must
first back out the usual in order to assess the reading.
COUNTER EXAMPLE #5:
This is similar to Example 5 except that the isolated
events do not occur at equally spaced intervals. This can happen due
to Easter being in one month in one year and another month in a
following year. St. Patrick's day does not always fall in week 17. Y(t)
= b0+ b1*I(t) + A(t) where I(t) is a 0 except when the isolated
event occurred and is then equal to 1.
COUNTER EXAMPLE #6:
This example illustrates a weakness in the classical procedure insofar
as it fails to deal with a sequence of values, none of which
individually are exceptional, but collectively they represent non-standard
values. Y(t) = b0+ b1*L(t) + A(t) where
L(t) is a 0 before the event and is 1 thereafter.
THE MESSAGE IS CLEAR
To cleanse one has to model. A predictive equation is an integral part
of data filtering or data cleansing. Unusual values distort the
development if the predictive equation but one can be developed, leading
to data cleansing leading to model refinement.
DATA CLEANSING (FILTERING) IN A CAUSAL SETTING OR MEASURING THE EFFECT
OF A PROMOTION
In order to accurately predict the effect of planned actions, we must
assess or estimate the "anticipated lift or increase" that
should arise when this action takes place again. We study historical
data and draw inference, i.e. " estimate " what has been the
" average response " to historical sales when similar prior
actions have taken place. Consider that if you " promote " at
six points in time and that for 5 out of the 6 periods you get an
increase of sales of 2 units. A biased estimated of the " lift
" would be 1.6667 [( 5/6 ) * 2.0 ]. A " robust estimater of
the lift " would be 2 as the one anomaly (i.e the one period of
non-response) would be discounted . View
a plot of the actual sales. Now we show
the six points in time when the historical promotion or action took
place. The
estimated response, albeit biased by the anomaly, to the promotion
is shown. This is shown graphically, with an obviously deficient fit
(forecast), at time period 20 (the point of non-response). Note also how
for the other five points in time that the promotion took place there is
a serious
undestimation of the effect. This understimation is due to the
biased estimator .
We now show the
robust estimation where the anomaly has been identified and its inpact
on the promotion response has been nullified. Notice that estimated
anomaly is approximately 1.95 or approximately 2.0. Presenting this graphically.
To summarize, we present the original sales
series with the unusual value demarked at time period 20.
|