outlier detection

QUESTION:

Can I get help in the following: I am working on five different genotypes of Sunflowers. I have used regression

to study the trends in few parameters. Some parameters have linear fit and some have quadratic fit

(parameter * time curves). Now I have problems in comparing those curves. I tried using covariance model to

compare intercept and slope separately, but that won't tell which lines are differing significantly or not. -By using

SAS can we compare the linear regression equations. Can we use any kind of mean comparision method to compare

the curves? -If one genotype fits linear curve and the other quadratic curve can we compare them?

ANSWER:

This question is as old as time itself (pun). How to distinguish between trend and random or both. Should a trend line

line ( perhaps multiple trends ? ) be used or should the forecast be based explictely on the last value plus/minus a

constant ? A trend model is a particular case of a DETERMINISTIC model i.e. an intervention model where the user

assumes that there is one and only one trend and that the trend starts immediately at time period 1 and continues

through the last point. This is an example of an INTERVENTION MODEL where the user knows a priori of this truth

and proceeds to estimation. INTERVENTION DETECTION can also lead to this model if the data so evidences it. The

root problem or opportunity is how you model the relationship of the sequential, equally-spaced observations. There are

two major approaches and I believe that the data should be allowed to suggest by analysis which one is appropriate.

Following is a brief discussion of the issue HOW TO INCORPORATE TIME Traditional time oriented re gression

analysis is usually presented in the following form: Y(_t) =W0 + W1 * T + A(_t) where W0 is the intercept and W1 is the

trend or slope and T is the counting numbers 1,2,3,,,N and A is a Gaussian Process. Y of course is the observed recordings,

readings, made at N equi-distant and consecutive points in time. In the following, for pedantic purposes, we assume a

simple trend and a simple AR(1) to illustrate the point. 3 Possibilities: 1 Deterministically Y(_t) = W0 + W1*T

2 Stochastically Y(_t) = W0 + W1*Y(_t-1) 3 Both Y(_t) = W0 + W1*T + B2*Y(_t-1) DETERMINISTIC MODEL It is conventional

in econometric model building to use polynomials and dummy variables to describe "trends". Such methods are unsatisfactory

since it is unlikely that such deterministic trends are adequate to describe the development of observed time series

implying as they do that "growth rates remain constant indefinitely".

STOCHASTIC MODEL

By incorporating differences or lags into the model one can capture stochastic or adaptive trends into the model.

EXAMPLE OF A DETERMINISTIC MODEL

Y(_t) =W0 + W1 T + A(_t)

where T = 1,2,3,....N

The forecast for time period t+1 is independent of the most recent observation (Y(T)) save for the fact that the

parameters (W0 and W1) are estimated using the Y(T) reading. Even if one recomputed the model coefficients

after each new reading the impact on the one-period out forecast would be small since each observation is equally

weighted and and no particular importance is placed on recent events or readings. The fact that all T observations

participate in an egalitarian way, aside from weighted leas t squares, is both the strength and the weakness of this model.

EXAMPLE OF A STOCHASTIC MODEL

Y(_t) =W0 + W1 Y(_t-1) + A(_t)

where T = 1,2,3,....N

The forecast for time period t+1 is dependent of the most recent observation (Y(T)) since Y(T) is explicitly used and

that parameters (W0 and W1) are estimated using the Y(T) reading. Even if one recomputed the model coefficients

after each new reading the impact on the one-period out forecast would still be significant due to the explicit dependence

on the previous observation. The fact that all T observations participate in an egalitarian way in the estimation of the

parameters, but the forecast is de pendent on the most recent reading is both the strength and the weakness of this model.

AN EXAMPLE OF HOW ONE MIGHT IDENTIFY THE NEED FOR A SIMPLE TREND MODEL
Suppose that the following model is the true model;
Y(_t) =W0 + W1 T + A(_t) ...EQUATION 1...
Since EQUATION 1 holds then EQUATION 2 follows (simply backspacing);
Y(_t-1) =W0 + W1 (_t-1) + A(_t-1) ...EQUATION 2...
Subtract EQUATION 2 from EQUATION 1 and get EQUATION 3 ;
Y(_t)-Y(_t-1)=W0 - W0 + W1 (_t)-W1 (_t-1) + A(_t)-A(_t-1)
Simplifying , recognizing that (_t) - (_t-1) is unity we get ;
Y(_t)-Y(_t-1) =W1 + A(_t)- 1 A(_t-1)
(1-B)Y(_t) =W1 + A(_t)- 1 A(_t-1)
(1-B)Y(_t) =W1 + (1-顳)A(_t)

which is a first order difference model with a coefficient of 1 for the moving average coefficient Thus if one "identifies"

a stochastic model that requires differencing to make it stationary and upon estimation the moving-average coefficient is

1, the conclusion is that the stochastic model might be inadequate and should be replaced by a deterministic model. In

practice , one simply lets the alternative approaches vie for supremacy or dominance by efficient search procedures

leading to parsimonious mode ls that can and often do include both kinds of structures. AUTOBOX conducts these

numerical tournaments in its development of adequate model forms. The issue here was to show how a deterministic

trend model of the form can be expressed as particular ARIMA model with a root on the unit circle. After one has

adequately described each one of the time series, then one can take the most complicated or general model and estimate

it locally using each time series series independently. The Chow test, developed in t he early 60's follows thereafter

to test the hypothesis of a common set of coefficients across all groups. AUTOBOX has extended the CHOW test to

time series. I am sure that if you have good SAS skills it shouldn't take too much time to implement this in SAS.

On the other hand AUTOBOX might be a viable alternative since these features are already in place.

Note that intervention detection can lead to models like


*Y = W0 + W1T1 + B2T2*

WHERE T1 = 1	2	3						...T
WHERE T2 = 0	0	0	1	2	3	..T-3


*Y = W0 + W1S1/(1-B) + B2S2/(1-B)*

WHERE S1 = 1	1	1	1	1	1	1	1 FOR ALL T
WHERE S2 = 0	0	0	1	1	1	1	1 FOR ALL T

SINCE A STEP IS EQUAL FIRST DIFFERENCES OF A TREND
AND A PULSE IS EQUAL FIRST DIFFERENCES OF A STEP

The presence of local trends is often overlooked in time series models. If you enable this feature AUTOBOX

will test for a variety of TIME variables and identify the optimal point where each trend starts.