QUESTION:

Why and How do we Cleanse or Filter data OR Find the Unusual Value(s)?

ANSWER: To eliminate values that are not representative of the process that we are attempting to measure or describe.


QUESTION:

What does an unusual value do to model parameters ?

ANSWER:

Parameter estimation hinges on finding the solution that minimizes the error sum of squares. Incorrect parameter estimation can be avoided by robust procedures. However, Robust Regression only deals with one-time values, i.e. pulses. In the absence of robust procedures, early efforts in applied statistics tried to perform preliminary screening, but often reported failure. General and largely unsatisfactory remarks on unusual values tended to diminish the magnitude of the problem through errors of omission. Time series data offers opportunities, and pitfalls not found in cross-sectional data.

QUESTION:

What do we mean by robust?

ANSWER:

Both the outlier (far from expectation) and the inlier (too usual looking) are exceptional or unusual values . Not necessarily because of their size or magnitude, but because they are inconsistent with a reasonable prediction.
Thus to determine an unusual value one must first determine what a usual value is.


When computing a mean and using it to describe or summarize a sequence of data values, we are advised that the mean may not be truly representative of "central tendency" as it can be unduly effected by unusual values. We are told that a more a robust estimate of "central tendency" would be the "median value". As a case in point, consider the "mean income of baseball players" and how it is effected by "unusual values".

In a sentence, a "robust estimator" or "a method of robust estimation"
is one that is relatively "unaffected by violations of the assumptions" under which the estimation took place.

What is the standard approach? Unknowingly, most procedures assume the model or prediction equation:
Y(t) = U + A(t) and compute the standard deviation of A's, which is identical to the standard deviation of the original Y series. They then proclaim that a central band of acceptable values, i.e. usual values will fall in the range of U +/- K*SIGMA where U is the mean and SIGMA is the estimated standard deviation using standard computations. Values outside this range are considered unusual.

POSSIBLE PROBLEMS (OR OPPORTUNITIES):

We will consider a number of possible circumstances where the actual data suggest a different underlying model thus masking unusual values. It is possible to envision cases where the opposite would occur, i.e. unusual values were caused by the bad model. This can happen when inliers arise thus deflating the standard deviation thus causing one to under-estimate the acceptable range thus discarding values that should not have been discarded.

 

COUNTER EXAMPLE #1:

Assume that the observed data arises in such a way as the following model or prediction equation approximately holds: Y(t) = b0+ b1*T + A(t) where T is a Trend Series ( 1,2,3,4,,,,,,,t). If we use the standard deviation of the original series as the estimate of the allowable range we are clearly going to miss out on a lot of unusual values. However, if we compute the standard deviation of the residuals from the expected value than a much clearer picture of the allowable range comes into place. In practice, we then Y(t) - b0+ b1*T = A(t) estimate the A's and compute the standard deviation from those values. Perhaps a review of the standard deviation might help. When the average or mean changes level some statistical forecasting tools get confused and declare that a trend is present. Consider [1] and compare it to [2]. When shown together the ambiguity dissolves. Consider another example of a sales series. Some of the unusual values are distinctly usual. Aside from the single pulse and the seasonal pulses the series could be adequately described as a sequence of level shifts. Summarizing the analysis visually as compared or contrasted to a trend model.

COUNTER EXAMPLE #2:

Consider the series 1,9,1,9,1,9,9,9,1,9. The usual computation of the standard deviation would cause a failure to detect the anomaly. Replacing the MEAN with the EXPECTED VALUE [Y(t-2)] would result in a standard deviation that is smaller (a lot smaller !) and thus the anomaly in the 7th value would be immediately identified. Consider a graphical presentation of the OUTLIER series 1,9,1,9,1,9,9,9,1,9 . Note that if one were to estimate the model: Y(t) = b0+ b1*Y(t-2) + A(t) the least squares estimate of b1 would be effected by the anomaly, but the unusual value at time period seven would still be clearly identifiable.

COUNTER EXAMPLE #3:

 

Consider another example. This time of an INLIER. 1,9,1,9,1,9,5,9,1,9 .

COUNTER EXAMPLE #4:

Sometimes due to predictable isolated events such as a plant closing or a large order that arises say every November there are unusual values that are and should be considered usual. Another way of saying this is that one analyst's noise is another analyst's signal. Y(t) = b0+ b1*SP(t) + A(t) t SP(t) 1 0 JAN 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 0 . 10 1 NOV. 11 0 12 0 13 0 JAN 14 0 . 15 0 . 16 0 . 17 0 . 18 0 . 19 0 . 20 0 . 21 0 . 22 1 NOV. ... 0 where SP(t) = 0 except for November where it is equal to 1. Thus to filter or detect unusual values we must first back out the usual in order to assess the reading.

COUNTER EXAMPLE #5:

This is similar to Example 5 except that the isolated events do not occur at equally spaced intervals. This can happen due to Easter being in one month in one year and another month in a following year. St. Patrick's day does not always fall in week 17. Y(t) = b0+ b1*I(t) + A(t) where I(t) is a 0 except when the isolated event occurred and is then equal to 1.

COUNTER EXAMPLE #6:

This example illustrates a weakness in the classical procedure insofar as it fails to deal with a sequence of values, none of which individually are exceptional, but collectively they represent non-standard values. Y(t) = b0+ b1*L(t) + A(t) where L(t) is a 0 before the event and is 1 thereafter.

 

THE MESSAGE IS CLEAR
To cleanse one has to model. A predictive equation is an integral part of data filtering or data cleansing. Unusual values distort the development if the predictive equation but one can be developed, leading to data cleansing leading to model refinement.

DATA CLEANSING (FILTERING) IN A CAUSAL SETTING OR MEASURING THE EFFECT OF A PROMOTION

In order to accurately predict the effect of planned actions, we must assess or estimate the "anticipated lift or increase" that should arise when this action takes place again. We study historical data and draw inference, i.e. " estimate " what has been the " average response " to historical sales when similar prior actions have taken place. Consider that if you " promote " at six points in time and that for 5 out of the 6 periods you get an increase of sales of 2 units. A biased estimated of the " lift " would be 1.6667 [( 5/6 ) * 2.0 ]. A " robust estimater of the lift " would be 2 as the one anomaly (i.e the one period of non-response) would be discounted . View a plot of the actual sales. Now we show the six points in time when the historical promotion or action took place. The estimated response, albeit biased by the anomaly, to the promotion is shown. This is shown graphically, with an obviously deficient fit (forecast), at time period 20 (the point of non-response). Note also how for the other five points in time that the promotion took place there is a serious undestimation of the effect. This understimation is due to the biased estimator .

We now show the robust estimation where the anomaly has been identified and its inpact on the promotion response has been nullified. Notice that estimated anomaly is approximately 1.95 or approximately 2.0. Presenting this graphically. To summarize, we present the original sales series with the unusual value demarked at time period 20.