QUESTION:  

Can You Give Me Another View of Outliers 

ANSWER:  

In article 376765C9.80A4903E@land4.nsu.ru>, Alexander Tsyplakov tsy@land4.nsu.ru> wrote: The aim of this message is to provide reference for those interested in detecting outliers. In my previous message I've mentioned a simple outliers test. Let me quote from the message.   If one assumes normality then the following procedure for detecting outliers (observations not from the same sample) makes sense. - Define dummy variable which is one for some specific observation and zero elsewhere. - Add this variable to the model and test the hypothesis that it's coefficient is zero using conventional t-statistics. If the hypothesis is rejected then the observation is under suspicion (is an "outlier"). Of course, normal errors from time to time give extreme residuals. The probability to make wrong inference is 1% with 1% bound. So one detects roughly 10 "outliers" for 1000 observations.   This test was _not_ invented by me. It is actually classical diagnostic test in regression analysis and could be found in many textbooks. Examples are   * Cook and Weisberg (1982) Residuals and Influence in Regression. Chapman and Hall.   * Sen, Srivastava (1990) Regression Analysis: Methods and Applications, Springer-Verlag.   * Kramer, and Sonnberger (1986) The Linear Regression Model Under Test. Phisica-Verlag.   This statistics is the same as "studentized residuals" (RStudent). It is convenient to plot squared studentized residuals (which would possess F distribution) to see if there are any extreme cases.   Certainly, one have to scrutinize the data and check if there is any qualitative evidence that this observations are really anomalous. However, as my experience shows this procedure is very helpful in applied work. Several times I was able to detect mistyped data in this way. I would recommend teachers to mention this test in regression courses. Also I wish producers of regression software would include this as a regression diagnostics option.   Multiple outliers extension of this test is obvious. Let I be the set of observations under suspicion.   - Define a group of dummy (binary, dichotomous) variables which are one for some observation from I and zero elsewhere. - Add this variables to the model and test the hypothesis that their coefficients are zero using conventional F-statistics. If the hypothesis is rejected this could be considered as an evidence that the observations are outliers.   This test might also be viewed as predictive failure test. References are   * Chow, G.C. "Tests of Equality between Sets of Coefficients in Two Linear Regressions," Econometrica, 28 (1960), 591-605.   * Salkever, D.S. "The Use of Dummy Variables to Compute Predictions, Prediction Errors and Confidence Intervals," J. of Econometrics, 4 (1976), 393-397.   * Dufour, J.M. "Dummy Variables and Predictive Tests for Structural Change," Economic Letters, 6 (1980), 241-247.   This is quite obvious and simple procedure. Later I'll try to make comments on Cook's D statistic mentioned by Prof. Ramirez.   Alexander Tsyplakov Novosibirsk State University     David Says:   The purpose of my post is twofold:   1. to thank Prof. T. for a clear and keen insight regarding this issue.   2. to invite readers of this to review http://www.autobox.com/outlier.html which discusses this topic in its time series context.   Prof T. didn't mention specifically , I believe , that outliers or individual pulses can and often do arise at the same point in a season e.g. a spike in April. This would be an example of the need for a seasonal pulse variable which would be 0 everywhere except ,in this case , April which would have a 1 . AUTOBOX identifies the need for these kinds of variables and adds them to the "regression" .   When you have an uninterrupted sequence of "unusual values" with approximately the same magnitude (value and sign) then one should consider adding a variable of the form 0,0,0,0,0,1,1,1,1,1,1 which will capture the "level shift" or "step shift" . Often we can generate multiple level shifts which can reflect a large number ofpossibilitiies such as   0,0,0,0,0,1,1,1,1,0,0   or   0,0,0,0,0,1,1,1,1,1,1 and 0,0,0,0,0,0,0,0,1,1,1 to deal with a second movement in the mean of the residuals.   Yet another type of dummy variable might be required reflecting a local time trend. Consider   the observed series 3,3,3,3,4,5,6,7,8 ...an appropriate dummy variable is 0,0,0,1,2,3,4,5,6   AUTOBOX refers to this as a Local Time Trend whose form is found by diagnostic model checking.   Note that (1-B)*L = P thus L = P/(1-B) (1-B)*T = L thus T = L/(1-B)   where B is the backshift operator such that y(t-1) = B*y(t)   In summary   Outliers or Interventions can be treated with   PULSES SEASONAL PULSES ... ONLY APPLIES TO TIME SERIES LEVEL SHIFTS ... ONLY APPLIES TO TIME SERIES TIME TRENDS ... ONLY APPLIES TO TIME SERIES   and finally the word outlier is a classic statistical "mispeak" as it should simply be an exceptional value. Consider the series   1,9,1,9,1,9,5,9 this series has an "inlier" . can you find it ? Thus a value of 5 .... nearly the mean can be unusual ! This is not an "OUTLIER" but an "INLIER" .     regards   Dave Reilly