A TYPICAL REQUEST FOR FUNCTIONAL INFORMATION:

  Please repond to the following specific questions regarding AUTOBOX?

ANSWER:

 

1.0 QUESTION : PLEASE PROVIDE AN OVERVIEW OF AUTOBOX'S FORECASTING TECHNIQUES.

 

 

AUTOBOX handles a single endogenous equation

incorporating either pre-identified causal series or

empirically identified dummy series which are found to be

statistically significant. The set of pre-identified series

can be either stochastic or deterministic (dummy) in form.

In its search for the most appropriate model form and the

optimal set of paramaters the program can either be:

1. Purely empirical or

2. A starting model could be used.

A final model may require one or more of the following

structures:

1. Power transforms like Log, Square Root, Reciprocal

etc.

2. Variance stabilization due to deterministic changes

in the background error variance.

3. Data segmentation or splitting as evidenced by a

statistically significant change in either model form or

parameters.

Enroute to its tour de force AUTOBOX will evaluate

numerous possible models/parameters that have been suggested

by the data itself. In practice, a realistic limit is set on

the maximum number of model form iterations. The exact

specifics of each tentative model is not pre-set thus the

power of AUTOBOX emerges. The kind and form of the tentative

models may never before been tried. Each dataset speaks for

itself and suggests the iterative process. The Final Model

could be as simple as:

 

 

1. A simple trend model or a simple ordinary least

squares model.

2. An exponential smoothing model.

3. A simple weighted average where the weights are

either equal or unequal.

4. A Cochrane-Orcutt or ordinary least squares with a

first order fixup.

5. A simple ordinary least squares model in differences

containing some needed lags.

6. A spline-like set of local trends superimposed with

an arbitrary ARIMA model and perhaps a pulse or two.

The number of possible final models that AUTOBOX could

find is infinite and only discoverable via a true expert

system like AUTOBOX.

A final model may require one or more of the following

seasonal structures:

1. Seasonal ARIMA structure where the prediction depends

on some previous reading S periods ago.

2. Seasonal structure via a complete set of seasonal

dummies reflecting a fixed response based upon the particular

period.

3. Seasonal structure via a partial set of seasonal

dummies reflecting a fixed response based upon the particular

period.

The Final Model will satisfy both:

1. Necessity tests that guarantee the estimated

coefficient is statistically significant.

2. Sufficiency tests that guarantee that the error

process is:

-unpredictable on itself.

-not predictable from the set of causals.

-has a constant mean of zero.

The Final model will contain one or more of the

following structures:

1. CAUSAL with correct lead/lag specification.

2. MEMORY with correct "autoregressive memory".

3. DUMMY with correct pulses, level shifts or spline

time trends

 

2.0 QUESTION : IS AUTOBOX AUTOMATED, SEMI-AUTOMATED OR MANUAL ?

 

AUTOBOX provides both automated, semi-automated and

manual capabilities. AUTOBOX has a complete set of

forecasting features that will appeal to both novice and

expert forecasters. Autobox's automatic features are

unparalleled in breadth and depth of implementation. Autobox

is truely the power forecasters dream tool with a pallette of

tools that allows the forecaster to build models that work.

 

 

 

AFS was the first company to automate the BJ model

building process. Our approach is to program the model

identification, estimation and diagnostic feedback loop as

originally described by Box and Jenkins. This is implemented

for both ARIMA (univariate) modeling and Transfer Function

(multivariate or regression) modeling. What this means is

that the user, from novice to expert, can feed Autobox any

number of series and the programs powerful modeling heuristic

can do the work for you. This option is implemented in a

such that it can be turned on at any stage of the modeling

process. There is complete control over the statistical

sensitivities for the inclusion/exclusion of model parameters

and structures. These features allow the user complete

control over the modeling process. The user can let Autobox

do as much or as little of the model building process as you

or the complexity of the problem dictates.

 

Autobox comes with a complete set of indentification and

modeling tools for use in the BJ framework. This means that

you have the ability to transform or prewhiten the chosen

series for identification purposes. Autobox handles both

ARIMA (univariate) modeling and Transfer Function

(multivariate) modeling allowing for the inclusion of

interventions (see below for more information). Tests for

interventions, need for transformations, need to add or

delete model parameters are all available. Autocorrelation

(both traditional and robust), partial autocorrelation and

cross-correlation functions and their respectives tests of

significance are calulated as needed. Model fit statistics,

including Rē, SSE, variance of errors, adjusted variance of

errors all reported. Information criteria statistics for

alternate model indentification approaches are provided.

 

One of the most powerful features of Autobox is the

inclusion of Automatic Intervention detection capabilities in

both ARIMA and Transfer Function models. Almost all

forecasting packages allow for interventions to be included

in a regression model. What these packages don't tell you is

how sensitive all forecasting methodologies are to the impact

of interventions or missing variables. These packages don't

tell you if your series may be influenced by missing

variables or changes that are outside the current model. If

a data series is impacted by changes in the underlying

process at discrete points in time, both ARIMA models and

Transfer Function models will produce poor results. For

example, a competitors price change changes the level of

demand for your product. Without a variable to account for

this change you forecast model will perform poorly. Autobox

implements ground breaking techniques which quickly and

accurately indentify potential interventions (level shifts,

season pulses, single point outliers and changes in the

variance of the series). These variables can then be

included in your model at your discretion. The result is

more robust models and greater forecast accuracy.

 

 

 

All forecast packages allow for you to produce forecasts

using the models you have constructed. Autobox presents the

critical information you need to determine if those forecasts

are acceptable. Autobox has options that allow you to

analyze the stability and forecasting ability of your

forecast model. This is achieved through a series of

ex-poste forecast analyses. You can automatically withhold

any number of observations, reestimate the model form and

forecast. Observations are then added back one at a time and

the model is reestimated and reforecast. Forecast accuracy

statistics, including Mean Absolute Percent Error (MAPE) and

Bias, are calculated at each forecast end point. Thus the

stability of the model and its ability to forecast from

various end points can be analyzed. Finally, you can

optionally allow Autobox to actually re-identify the model

form at each level of withheld data to see if the model form

is unduly influenced by recent observations.

 

3.0 QUESTION : PLEASE EXPLAIN YOUR APPROACH TO CAUSAL MODELLING ?

 

QUICK ANSWER: AUTOBOX provides a very comprehensive

range of causal models , including but not limited to

incorporating lead effects as well as contemporaneous and lag

effects. It can detect and compensate for changes in

variance , changes in model form and changes in parameters

 

LONG ANSWER:

Multiple Regression was originally developed for

cross-sectional data but Statisticians/Economists have been

applying it ( mostly incorrectly ) to chronological or

longitudinal data with little regard for the Gaussian

assumptions of constant mean of the errors, constant

variance, identical distribution of the errors and

independence of the errors. AUTOBOX tests for and remedies

any proven violations.

Following is a brief introduction to time series

analysis

Time series = a sequence of observations taken on a

variable or multiple variables at successive points in time.

Objectives of time series analysis:

1. To understand the structure of the time series (how

it depends on time, itself, and other time series variables)

2. To forecast/predict future values of the time series

What is wrong with using regression for modeling time

series?

* Perhaps nothing. The test is whether the residuals

satisfy the regression assumptions: linearity,

homoscedasticity, independence, and (if necessary) normality.

It is important to test for Pulses or one-time unusual values

and to either adjust the data or to incorporate a Pulse

Intervention variable to account for the identified anomaly.

Unusual values can often arise Seasonally , thus one has

to identify and incorporate Seasonal Intervention variables.

Unusual values can often arise at successive points in

time earmarking the need for either a Level Shift

Intervention to deal with the proven mean shift in the

residuals.

* Often, time series analyzed by regression suffer from

autocorrelated residuals. In practice, positive

autocorrelation seems to occur much more frequently than

negative.

* Positively autocorrelated residuals make regression

tests more significant than they should be and confidence

intervals too narrow; negatively autocorrelated residuals do

the reverse.

* In some time series regression models, autocorrelation

makes biased estimates, where the bias cannot be fixed no

matter how many data points or observations that you have.

To use regression methods on time series data, first

plot the data over time. Study the plot for evidence of

trend and seasonality. Use numerical tests for

autocorrelation, if not apparent from the plot.

* Trend can be dealt with by using functions of time as

predictors. Sometimes we have multiple trends and the trick

is to identify the beginning and end periods for each of the

trends.

* Seasonality can be dealt with by using seasonal

indicators (Seasonal Pulses) as predictors or by allowing

specic auto-dependence or auto-projection such that the

historical values ( Y(t-s) ) are used to predict Y(t)

* Autocorrelation can be dealt with by using lags of the

response variable Y as predictors.

* Run the regression and diagnose how well the

regression assumptions are met.

* the residuals should have approximately the same

variance (homoscedasticity) otherwise some form of "weighted"

analysis might be needed.

* the model form/paramaters should be invariant i.e.

unchanging over time. If not then we perhaps have too much

data and need to determine at what points in time the model

form or parameters changed.

 

Time series data presents a number of

problems/opportunities that standard statistical packages

either avoid or ignore. 1. How to determine the temporal

relationship for each input series ,i.e. is the relationship

contemporaneous, lead or lag or some combination ? ( How to

identify the form of a multi-input transfer function without

assuming independence of the inputs .) 2. How to determine

the arima model for the noise structure reflecting omitted

variables. 3. How to do this in a ROBUST MANNER where pulses,

seasonal pulses , level shifts and local time trends are

identified and incorporated. 4. How to test for and include

specific structure to deal with non-constant variance of the

error process. 5. How to test for and treat non-constancy of

parameters or model form. 6. Do we model the original series

or the differenced series ? AUTOBOX deals with these issues

and more .

 

A very natural question arises in the selection and

utilization of models. One asks,"Why not use simple models

that provide uncomplicated solutions?" The answer is very

straightforward, "Use enough complexity to deal with the

problem and not an ounce more". Restated, let the data speak

and validate all assumptions underlying the model. Don't

assume a simple model will adequately describe the data. Use

identification/validation schemes to identify the model.

 

A transfer function can be expressed as a lagged

autoregression in all variables in the model. AUTOBOX

reports this form so users can go directly to spreadsheets

for the purposes that you require. Care should be taken to

deal with gaussian violations such as Outliers (pulses) ,

Level Shifts , Seasonal Pulses , Local time trends , changes

in variance , changes in parameters , changes in models

...... just to name a few ..

 

 

4.0 QUESTION : PLEASE DESCRIBE SOME OF THE LIMITATIONS IN AUTOBOX ?

 

A maximum of 29 input series is allowed but larger

versions are avaialable. A maximum of 600 observations is

allowed but larger versions are avaialable. A maximum of 156

periods can be forecasted but larger versions are avaialable.

5.0 QUESTION : CAN MODELS BE STORED FOR REUSE ?

 

AUTOBOX delivers the model in a machine-readable fashion

so that it can be reused at a later point.

 

6.0 QUESTION : DOES AUTOBOX HANDLE OUT-OF-SAMPLE DATA FOR MODEL

TESTING AND/OR MODEL IDENTIFICATION ?

 

AUTOBOX provides a comprehensive summary of forecasting

performance and allows for forecast errors to be tracked by

origin and by lead time. In a sentence, one 12 period

forecast error is not the same as twelve one period forecast

errors. Kind of obvious, but most people never think of

tracking performance from different origins.

ACTUAL DATA & FOUR PERIOD OUT FORECASTS FROM SIX ORIGINS

(LEAD TIME = 4 ; ORIGINS = 6)

ACTUAL 397 378 472 370 395 427

ORIGIN\DATE 1984/3 1984/4 1984/5 1984/6 1984/7 1984/8

1984/2 308 328 399 355

1984/3 347 472 387 431

1984/4 396 404 426 439

1984/5 421 436 441

1984/6 444 448

1984/7 444

 

Note: We have 6 estimates of a one-period forecast

error

Note: We have 5 estimates of a two-period forecast

error

Note: We have 4 estimates of a three-period forecast

error

Note: We have 3 estimates of a four-period forecast

error

 

 

Measures To Assess Forecast Model Performance

The above table contains all the raw data necessary to

assess how predictable the future is for alternative lead

times. The point is simple and profound. Forecasting

accuracy from a single launch point generates correlated

forecast errors. Forecast error analyses should use a number

of different origins and a number of lead times. To

reiterate, a set of historical values are used to perform

some modeling activity, be it automatic or not, in order to

come up with a model and a set of coefficients. These are

then fixed and each of the withheld observations are then

used as the launching point for a new set of forecasts. With

this approach, the new observations don't fully participate

in the modeling process and only affect the forecast but not

the model form nor its parameters.

Typical Output Tables From AUTOBOX

VALUES ARE IN TERMS OF THE ORIGINAL METRIC

Number of Actuals 3

Forecast Mean Deviation (Bias) 469.327

Forecast Mean Percent Error 1.20294

Forecast Mean Absolute Deviation 1601.47

Forecast Mean Absolute % Error 4.26474

Forecast Variance (Precision) .39e+077

Forecast Bias Squared (Reliability) 220268

Forecast Mean Square Error (Accuracy) .41E+07

Relative Absolute Error .3969

 

Typical Output Tables From AUTOBOX (MORE)

Lead MEAN MEAN % MEAN MEAN

Time DEVIATION ERROR ABSOLUTE ABSOLUTE %

(BIAS) DEVIATION ERROR

 

1 .18E+04 4.49 .41E+04 10.75

2 .22E+04 5.49 .36E+04 12.35

3 .38E+04 5.49 .48E+04 15.75

4 .15E+04 2.29 .21E+04 5.75

 

Typical Output Tables From AUTOBOX (MORE)

Lead VARIANCE BIAS MEAN RELATIVE

Time (PRECISION) SQUARED SQUARE ABSOLUTE

(RELIABILITY) ERROR ERROR

(ACCURACY)

1 .23E+08 .32E+07 .26E+08 .54

2 .24E+08 .36E+07 .24E+08 .62

3 .24E+08 .56E+07 .34E+08 .39

4 .14E+08 .56E+07 .44E+08 .34

 

 

ACCURACY = PRECISION + RELIABILITY

We will now define all of these terms so that you can

know how they were computed.

R = A - F and N = NAIVE FORECAST

a) Forecast Mean Deviation (Bias)

The simple average of the errors where bias is the

actual less the forecast.

b) Forecast Mean Percent Error

Expressing the error as a percentage of the actual we

get the percent error. If we average these percentages then

we get the average or mean percent error.

c) Forecast Mean Absolute Deviation

Each bias or error can cancel or offset another. This

statistic disables that potential flaw insofar as it computes

the absolute average disallowing cancellation of the errors.

d) Forecast Mean Absolute % Error

If we now take the simple percent errors and take there

absolute magnitude we can then compute the average or mean

percent error.

e) Forecast Variance

The simple sum of squares of the errors around the

average error is taken and averaged. Often called Precision.

f) Forecast Bias Squared

The overall average error is squared to compute this

statistic. This is often called Reliability.

g) Forecast Mean Square Error

The sum of the errors squared and averaged is often

called Accuracy.

h) Relative Absolute Error

Performance vis-a-vis a random walk prediction is often

a useful measure. Here we sum the absolote errors from the

model and divide it by the sum of absolute errors from a

random walk model.

 

Additionally , the user can run a tournament and select

that model and set of parameters that minimize an

out-of-sample error criterion such as MAPE or BIAS .

Furthermore the selection can be based on either a prediction

of all the values up to and including a particular lead-time

say 3 periods OR it can be solely based on that individual

lead time rather than the sum of all values up to that lead

time.

 

 

 

7.0 QUESTION : WHAT MODEL DIAGNOSTICS DOES AUTOBOX SUPPLY ?

AUTOBOX reports the standard set of model diagnostics as

explained below.

LONG ANSWER:

Number of Residuals (R) =n

Number of Degrees of Freedom =n-m +

Residual Mean = R/n +

Sum of Squares = R +

Variance var= R /(n) +

Adjusted Variance = R /(n-m) +

Standard Deviation = +

Standard Error of the Mean = / (n-m) +

Mean / its Standard Error = /[ / (n-m)] +

Mean Absolute Deviation = R /n +

AIC Value ( Uses var ) =nln +2m +

SBC Value ( Uses var ) =nln +m*lnn +

BIC Value ( Uses var ) =see Wei p153 +

R Square =1-[ R / (A- A) ] +

 

 

These are some of the test's reported:

1 model necessity with respect to estimated parameters

2 model necessity with respect to invertibility

3 automatic fixup with respect to detecting need for

seasonal dummies vis-a-vis seasonal differencing.

 

4 model sufficiency fixup w/respect to adding ARIMA

structure

5 model sufficiency fixup with respect to adding

outliers (ome-time outliers or pulses seasonal outliers,

level shifts and local time trends )

6 model sufficiency fixup w/respect to incorporating

weights to remedy variance changes not dependent on level

7 model sufficiency fixup w/respect to incorporating

variance stabilizing transormations to remedy variance

changes tied to level changes.

8 model changes over time ... detecting break points.

 

9.0 QUESTION : WHAT DATA FORMATS DOES AUTOBOX SUPPORT ?

AUTOBOX supports a full free standing and interactive

user who delivers data in either text, EXCEL or .dbf form.

AUTOBOX can also support a dll interface providing all

information to and from AUTOBOX within a user's application.

AUTOBOX provides a range of export facilities providing

full details for normal post-processing of forecasts , models,

interventions detected and equational forms that can be

used in spreadsheets for what-if analysis.

10.0 QUESTION : HOW DOES AUTOBOX ALLOW THE USER TO DEFINE AND USE

ANOMOLUS DATA ?

 

The overwhelming strength of AUTOBOX lies in its rich

procedures that deal with anomolous data. In particular the

user can specify specific user defined criteria to detect

anomolous data.

Following are the user-controls that would allow a user

to control the detection of anomolous data points.

Outliers can occur in many ways. They may be the result

of a gross error, for example, a recording or transcript

error. They may also occur by the effect of some exogenous

intervention. These can be described by two different, but

related, generating models discussed by Chang and Tiao (1983)

and by Tsay (1986).They are termed the innovational outlier

(IO) and additive outlier (AO) models. AUTOBOX uses the AO

approach due to estimation considerations. ARIMA modeling

may be deficient when the series has been intervened with.

This program will test the residuals from the ARIMA model for

possible outlier (intervention) variables.

The automatic intervention detection option

automatically determines the need for intervention variables

using the residuals from an estimated model and automatically

introduces them into the model.

AUTOBOX recognizes that a sequence of "UNUSUAL VALUES"

s period apart ( e.g. every December ) are not unusual at

all and should be treated not as anomolous but as part of

the prediction pattern.

AUTOBOX recognizes that a sequence of "UNUSUAL VALUES"

that have the same direction and magnitude support the

hypothesis of a level shift and thus are collectively

significant. Note that in practice none of the values may in

and of themselves be siginificant but collectively the

suggest a permanent shift.

AUTOBOX recognizes that a sequence of values may be best

descibed by a simple trend line. AUTOBOX can identify the

beginning and the end of each trend where each local trend

may have different slopes.

 

Note that anomolous data can sometimes be refeered to as

an inlier. Consider the sequence

1,9,1,9,1,9,5,9,.....

AUTOBOX would identify "5" as being anomolous even

though it was at the average.

 

We show here an extraction from the AFS document

describing AUTOBOX functionality.

 

Line 75

SUFFICIENCY (DETERMINISTIC STRUCTURE ) : | 1

Outliers can occur in many ways. They may be the result of a gross error, for example, a recording or transcript error. They may also occur by the effect of some exogenous intervention. These can be described by two different, but related, generating models discussed by Chang and Tiao (1983) and by Tsay (1986).They are termed the innovational outlier (IO) and additive outlier (AO) models. AUTOBOX uses the AO approach due to estimation considerations. ARIMA modeling may be deficient when the series has b een intervened with. This program will test the residuals from the ARIMA model for possible outlier (intervention) variables. We suggest that you modify either your model or your time series for any outlier variables that may be found. The automatic intervention detection option automatically determines the need for intervention variables using the residuals from an estimated model and automatically introduces them into the model.

Line 76

ENABLE THE OUTLIER TEST PRIOR TO ARIMA | 1

Classical INTERVENTION DETECTION required the initial ARIMA filter which would be used to pre-filter the series prior to detecting pulse, seasonal pulse, level shifts or local time trends. There are considerable cases in the literature which would suggest a regression against fixed dummy variables before the ARIMA identification. If this switch is enabled these dummy variables are identified from the original series.

Line 77

CONFIDENCE LEVEL FOR SUFFICIENCY (DS) |90.0

If you select the outlier detection option, then you must specify the confidence limit to be used for detecting possible outlier variables. For example, .80 indicates that the program should identify all outliers that are significant at the 80% level.

Line 78

MAXIMUM NUMBER OF OUTLIERS TO BE IDENTIFIED | 5

You may elect to limit AUTOBOX to a certain number of empirically identified outliers. As delivered, the standard product is limited to a maximum of 5 input series in a transfer function thus this integer can not exceed that limit. AFS sells larger versions which allow up to 19 input series. This feature allows the user to control the incorporation of potentially spurious interventions leading to numerical instability.

Line 79

INCLUDE PULSE VARIABLES | 1

Select "1" to include pulse interventions.

Line 80

ENABLE DYNAMIC PULSE TEST | 1

If pulses arise at consecutive points in time this may be an indicator that a transient intervention variable is more appropriate. If this is enabled and if consecutive pulses are identified then and only then will this possible model re-statement be made.

Line 81

INCLUDE STEP VARIABLES | 1

Choose "1" to include step interventions.

Line 82

MINIMUM NUMBER OF OBSERVATIONS IN GROUP | 2

The number entered determines how many successive values that are on a different level, before Autobox will consider there to be a level shift.

Line 83

INCLUDE SEASONAL PULSE VARIABLES | 1

Choose "1" to include seasonal pulse interventions.

Line 84

INCLUDE LOCAL TRENDS | 0

Choose "1" for Autobox to identify multiple trends.

Line 85

INCLUDE HIDDEN SEASONAL PULSE VAR 0=N 1=Y | 0

Choose "1" to activate this function.

Standard Seasonal Pulse detection uses the user specified periodicity as the key to identifying variables of this form:

SEASONALITY (eg. 12) | 12

Thus the program will limit its search to seasonal pulse variables that have a pattern of 11 0's and then a 1. The program tries all estimable candidates. If you have a series in which you are trying to find hidden deterministic structure, this approach may be insufficient. By enabling this option, the program will attempt to go beyond what might be the expected pattern interval and detect deterministic structure other than the norm. Evidence of patterns not consistent with expected pattern might motiva te further stochastic model investigation by a skilled time series analyst.

Line 86

MAXIMUM INTERVAL TO SCAN |

If you elected to "INCLUDE HIDDEN SEASONAL PULSE VARIABLES" then you must specify the maximum length of the pattern. The upper limit on this integer is 1/2 the length of the series. This input can significantly effect runtime length. Conventional search for a seasonal intervention as empowered by "Include seasonal pulse variables (YES/NO)" performs an exhaustive set of regressions to determine which variable (i.e. start period) is best. For example, if you had 100 observ ations and the data had a known periodicity ("Minor periods per major time interval (eg. 12)) of 12 then 100-12, or 88, models would be estimated. Each model would have a different candidate series. Each of these 88 possible intervention series would have a common seasonal pattern, a one followed by eleven zeroes. This implies that the user knows 'a priori' that this was the case. Consider cases where you wish " to search for hidden periodicities " and wish to allow the seasonal pattern to vary, for example in the range 2 to 12. This would imply that 88 + 89 + 90 +,,, 98 models would have to be tried or 11*(88+98)/2 = 1023 . The compute time can be large and evidenced cycles may be difficult to explain as they might reflect the hidden variable omitted from the model. Omitted variables can create what might be considered unusual lag structure or cyclical intervention variables that act as surrogates for the ever-popular unknown series, which may be uncollectible or unknown to the modeler.

Line 87

ENABLE AUTOMATIC FIXUP FOR SEASONAL DUMMIES | 0

Choose "1" to enable this option to test for the presence of a SEASONAL DETERMINISTIC VARIABLE which has a zero/one pattern according to the following:

a "1" in the corresponding period and a "0" in other periods

The formal test is outlined in Franses paper in the International Journal Of Forecasting, July 1991, pp 199-208 (see the help for the associated Confidence value.

Line 88

CONFIDENCE LEVEL FOR SEASONAL DUMMIES |95.0

If you elected to turn the Seasonal Dummy Test on, then you have the option of specifying the confidence level value that will be used to determine the significance of a parameter. For example, 1.96 indicates that the program will replace a stochastic seasonal difference factor with a set of seasonal dummies. Essentially the null hypothesis is that seasonal differences are appropriate. If one of the S roots is not significant then the seasonal differencing operator is re placed by a seasonal dummy requiring an initial S-1 parameters. If the first root is not significant then regular differences will be included Another aspect of this test is the possible identification of a linear trend series. For more information see Franses (1991).

 

 

11.0 QUESTION : HOW DOES AUTOBOX IDENTIFY ANOMOLUS DATA AND SCRUB THE DATA?

AUTOBOX has implented techniques proposed by Tiao, Tsay

and others. Additionally AUTOBOX has extended INTERVENTION

DETECTION to include local time trends.

 

AUTOBOX first builds a model and then examines the

evidented errors for abmormal residuals that could be

"explained" by Pulses, Seaonal Pulses , Level Shifts and

Local Time Trends. AUTOBOX removes outliers based on

abnormal residuals to a fitted model , not necessarily simple

but containing only significant paratmeters. The OUTLIER

DETECTION scheme does not employ a simple +/ range of

standardized residuals to determine the unusual because

AUTOBOX not only tests for PULSES or one-time onlies but

considers the "group effect" that might arise from a sequence

of contiguous outliers ... each individually being

non-significant but collectively quite significant. This

arise quite naturally in level shifts or local time trends.

 

To understand the SEARCH mechanism that AUTOBOX uses you

might consider it analagous to stepwise forward procedures

insofar as it searches the sample space , i.e. the possible

solutions , and determines that variable that would , if

included or added , significantly improve the model by

correcting an evidented or proven Gaussian violation.

AUTOBOX recomputes model residuals after each "fix" or

model augmentation thus eliminating the bias caused by the

previously identified anomalies.

 

The heart of the matter is the General Linear model

where Y is the variable to be predicted and the form of the X

matrix is to be determined empirically. Consider a case

where the noise process is uncorrelated. One could construct

simple regression models, where the trial X variable was

initially 1,0,0,0,0,0,0.....0,0,0,0 and evaluate the error

sum of squares. You could then create yet a second model

where the X variable was 0,1,0,0,0,0,0,0,0,0,0.......0,0,0,0

and evaluate it in terms of the resultant error sum of

squares. If you had 100 observations you would then run 100

regressions and the regression that generated the smallest

error sum of squares would then be the MAXIMUM LIKELIHOOD

ESTIMATE. This process would be repeated until the most

important NEW variable was not statistically significant.

This is essentially STEP-WISE forward regressioon, where the

X variable is found by a search method. Two important

generalizations are needed before you go running off to make

our new competitor; 1. The search for the X variable has to

be extended to include SEASONAL PULSES, LEVEL SHIFTS and TIME

TRENDS and 2. The error process may be other than white

noise, thus one has to iteratively construct TRANSFER

FUNCTIONS rather than multiple regression. The process gets

a little more sticky when you have pre-defined user inputs in

the model.

 

 

Outliers and structure changes are commonly encountered

in time series data analysis. The presence of the

extraordinary events could and have misled conventional time

series analysts resulting in erroneous conclusion. The

impact of these events is often overlooked however for the

lack of a simple yet effective means to incorporate these

isolated events. Several approaches have been considered in

the literature for handling outliers in a time series. We

will first illustrate the effect of unknown events which

cause simple model identification to go awry. We will then

illustrate what to do in the case when one knows a priori

about the date and nature of the isolated event. We will

also point out a major flaw when one assumes an incorrect

model specification. Then we introduce the notion of finding

the intervention variables through a sequence of alternative

regression models yielding maximum likelihood estimates of

both the form and the effect of the isolated event. Standard

identification of Arima models uses the sample ACF as one of

the two vehicles for model identification. The ACF is

computed using the covariance and the variance. An outlier

distorts both of these and in effect dampens the ACF by

inflating both measures. Another problem with outliers is

that they can distort the sample ACF and PACF by introducing

spurious structure or correlations. For example consider the

circumstance where the outlier dampens the ACF:

ACF = COVARIANCE/VARIANCE

Thus the net effect is to conclude that the ACF is flat;

and the resulting conclusion is that no information from the

past is useful. These are the results of incorrectly using

statistics without validating the parametric requirements.

It is necessary to check that no isolated event has inflated

either of these measures leading to an "Alice in Wonderland"

conclusion. Various researches have concluded that the

history of stock market prices is information-less. Perhaps

the conclusion should have been that the analysts were

statistic-less. Another way to understand this is to derive

the estimator of the coefficient from a simple model and to

evaluate the effect of a distortion. Consider the true model

as an AR(1) with the following

familiar form:

[ 1 - PHI1 B ] Y(t) = A(t) or Y(t) = A(t)/[ 1 - PHI1 B ]

[ 1 - PHI1 B ] Y(t) = A(t) or Y(t) = PHI1 Y(t-1) + A(t)

The variance of Y can be derived as: variance(Y) =

PHI1*PHI1 variance(Y) + variance(A) thus

PHI1 = SQRT( 1 - variance(A)/variance(Y) )

Now if the true state of nature is where an intervention of form I(t) occurs

at time period t with a magnitude of W we have:

Y(t) = {A(t)/[ 1 - PHI1 B ]}+ W I(t)

 

with variance(Y) = [PHI1*PHI1 variance(Y) + variance(A)] +

[W I(t)]*[W I(t)] =true variance(Y) + distortion thus

PHI1 = SQRT(1- [var(A) + [W I(t)]*[W I(t)] ]/variance(Y)

 

The inaccuracy or bias due to the intervention is not

predictable due to the complex nature of the relationship.

At one extreme the addition of the squared bias to

variance(A) would increase the numerator and drive the ratio

to 1 and the estimate of PHI1 to zero. The rate at which

this happens depends on the relative size of the variances

and the magnitude and duration of the isolated event. Thus

the presence of an outlier could hide the true model. Now

consider another option where the variance(Y) is large

relative to variance(A). The effect of the bias is to drive

the ratio to zero and the estimate of PHI1 to unity. A shift

in the mean would generate an ACF that did not die out slowly

thus leading to a misidentified first difference model. In

conclusion the effects of the outlier depend on the true

state of nature. It can both incorrectly hide model form and

incorrectly generate evidence of a bogus model.

 

These outliers were represented as intervention

variables of the forms: pulse, level shifts and seasonal

pulses. The procedure for detecting the outlier variables is

as follows. Develop the appropriate ARIMA model for the

series. Test the hypothesis that there is an outlier via a

series of regressions at each time period. Modify the

residuals for any potential outlier and repeat the search

until all possible outliers are discovered. These outliers

can then be included as intervention variables in a multiple

input B-J model. The noise model can be identified from the

original series modified for the outliers. AFS has extended

outlier detection to detecting the presence of local time

trends.

This option to the program provides a more complete

method for the development of a model to forecast a

univariate time series. The basic premise is that a

univariate time series may not be homogeneous and, therefore,

the modeling procedure should account for this. By

homogeneous, we mean that the underlying noise process of a

univariate time series is random about a constant mean. If a

series is not homogeneous, then the process driving the

series has undergone a change in structure and an ARIMA model

is not sufficient. The AUTOBOX heuristic that is in place

checks the series for homogeneity and modifies the model if

it finds any such changes in structure. The point is that it

is necessary for the mean of the residuals to be close enough

to zero so that it can be assumed to be zero for all intents

and purposes. That requirement is necessary but it is not

sufficient. The mean of the errors (residuals) must be near

zero for all time slices or sections. This is a more

stringent requirement for model adequacy and is at the heart

of intervention detection. Note that some inferior

forecasting programs use standardized residuals as the

vehicle for identifying outliers. This is inadequate when

the ARIMA model is non-null. Consider the case where the

observed series exhibits a change in level at a particular

point in time.

If you try to identify outliers or interventions in this

series via classical standardized residuals you get one

outlier or one unusual value. The problem is that if you

"fix" the bad observation at the identified time point, the

subsequent value is identified as an outlier due to the

recursive process. The simple-minded approach of utilizing

standardized residuals is in effect identification of

innovative outliers and not additive outliers.

The logic behind the automatic intervention procedure

has its roots in the technique proposed by Chang and Tiao

(1983) and programmed by Bell (1983). It starts by

developing an ARIMA model for the univariate time series

(using the automatic ARIMA algorithm). A series of

regressions on the residuals from the ARIMA model checks for

any underlying changes in structure. If the series is found

to be homogeneous, then the ARIMA model is used to forecast.

If the series is found to be nonhomogeneous, then the various

changes in structure are represented in a transfer function

model by dummy (intervention) input variables and the ARIMA

model becomes the tentative noise model. The program then

estimates the transfer function-noise model and performs all

of the diagnostic checks for sufficiency, necessity and

invertibility. The model is updated as needed, and the

diagnostic checking stage ends when all of the criteria for

an acceptable model are met. The final step is to generate

the forecast values. The user controls the level of detail

that the output report is to contain, as well as some key

options for modeling precision (lambda search and

backcasting, for example). The user can also elect to have

this process start with an examination of the original time

series. This may be necessary for those cases where the

series is overwhelmingly influenced by outlier variables.

We now present a summary of the mathematical properties

underlying this procedure. This is taken from the Downing

and McLaughlin (1986) paper (with permission!). For purposes

of this discussion, we present, in their notation, the

following equation, which is the general ARIMA model:

P(B) (N(t) - MEAN) = CONSTANT + T(B) A(t), (eq. 1) where

N(t) = the discrete time series, MEAN = the average of time

series, P(B) = the autoregressive factor(s), CONSTANT= the

deterministic trend, T(B) = the moving average factor(s),

A(t) = the noise series, and B = the backshift operator.

Outliers can occur in many ways. They may be the result

of a gross error, for example, a recording or transcript

error. They may also occur by the effect of some exogenous

intervention. These can be described by two different, but

related, generating models discussed by Chang and Tiao (1983)

and by Tsay (1986). They are termed the innovational outlier

(IO) and additive outlier (AO) models. An additive outlier

can be defined as,

Y(t) = N(t) + W E(to) (eq. 2)

while an innovational outlier is defined as, Y(t) = N(t)

+ [P(B)/T(B)] W E(to) (eq. 3) where Y = the observed time

series, t in length W = the magnitude of the outlier, E (t )

= 1 if t = to, 0 if t <>to

that is, E (t ) is a time indicator signifying the time

occurrence t o of the outlier, and N is an unobservable

outlier free time series that t follows the model given by

(eq. 1). Expressing Equation (eq. 2) in terms of white noise

series A in Equation (eq. 1), we find that for the AO model

Y(t) = [T(B)/P(B)] A(t) + W E(to), (eq. 4) while for the the

IO model Y(t) = [T(B)/P(B)][ A(t) + W E(to)], (eq. 5)

Equation (eq. 4) indicates that the additive outlier

appears as simply a level change in the t th observation and

is described as a o "gross error" model by Tiao (1985). The

innovational outlier represents an extraordinary shock at

time period to since it influences observations Y(to),

Y(to+1)..... through the memory of the system described by

T(B)/P(B).

The reader should note that the residual outlier

analysis as conducted in the course of diagnostic checking is

an AO type. Also note that AO and IO models are relatable.

In other words, a single IO model is equivalent to a

potentially infinite AO model and vice versa. To demonstrate

this, we expand equation (eq.5) to

Y(t) = [T(B)/P(B)] A(t) + [T(B)/P(B)] W E(to) , (eq. 6)

and then express (eq. 6) in terms of (eq. 4)

Y(t) = [T(B)/P(B)] A(t) + WW E(to) , (eq. 7)

where WW = [T(B)/P(B)] W .

Due to estimation considerations, the following

discussion will be concerned with the additive outlier case

only. Those interested in the estimation, testing, and

subsequent adjustment for innovative outliers should read

Tsay (1986). Note that while the above models indicate a

single outlier, in practice several outliers may be present.

The estimation of the AO can be obtained by forming

II(B) = [T(B)/P(B)] (eq. 8)

and calculating the residuals E(t) by

E(t) = II(B) Y(t) (eq. 9)

= II(B)[ [T(B)/P(B)] A(t) + W E(to) ]

= A(t) + W II(B) E(to) .

By least squares theory, the magnitude W of the additive

outlier can be estimated by

EST of W(to) = n*n II(B) E(to) (eq. 10)

The variance of W(to) is given by:

Var(W(to)) = n*n var(A) (eq. 11)

where var(A) is the variance of the white noise process

A(t) .

Based on the above results, Chang and Tiao (1983)

proposed the following test statistic for outlier detection:

ç(to)= EST W(to) / n sqrt(var(A)). (eq. 12)

If the null hypothesis of no outlier is true, then ç(to)

has the standard normal distribution. Usually, in practice

the true parameters II and åý are unknown, but consistent

estimates exist. Even more important is the fact that to,

the time of the outlier, is unknown, but every time point may

be checked. In this case one uses the statistic:

ç = max absolute value of ç(to) where to goes from 1 to

n (eq. 13) and declares an outlier at time to if the maximum

occurs at to and is greater than some critical value C. Chang

and Tiao (1983) suggest values of 3.0, 3.5 and 4.0 for C.

The outlier model given by Equation (eq. 4) indicates a

pulse change in the series at time to. A step change can

also be modeled

simply by replacing E(to) with S(to) where:

S(to) = 1 if t greater than to (eq. 14)

0 if not

We note that (1-B)S(to) = E(to) . Using S(to) one can

apply least squares to estimate the step change and perform

the same tests of hypothesis reflected in Equations (eq. 12)

and (eq. 13). In this way, significant pulse and/or step

changes in the time series can be detected.

A straightforward extension of this approach to transfer

functions has also been introduced in this version of

AUTOBOX. This, of course, implies that the outliers or

interventions are not only identified on the basis of the

noise filter but the form and nature of the individual

transfer functions.

 

 

12.0 QUESTION : HOW DOES AUTOBOX REPORT IT'S RESULTS ? CAN A USER SELECT

THE AMOUNT OF DETAIL ?

Yes ! We show here an extraction from the AFS document

describing AUTOBOX reporting options.

 

OUTPUT OPTIONS: |

Line 112

DISPLAY IDENTIFICATION INFORMATION | 1

If no detail is required in the initial identification process send a "no". Autocorrelation is a measure of the unconditional dependence that exists between observations in a time series that are separated by a particular time interval, called lag. The value of the autocorrelation lies between +1 and -1. The closer the autocorrelation is to +1 and -1, the more highly correlated are the observations separated by the particular lag being considered. In summary, the autocor relation measures the unconditional relationship between lags. Partial Autocorrelation is a measure of the conditional dependence that exists between observations in a time series that are separated by a particular time interval. The value of the partial autocorrelation lies between +1 and -1 and is evaluated just like the ACF. In summary, the partial autocorrelation measures the conditional correlation between lags. Cross Correlation is a measure of the dependence that exists between observations in two ti me series that are separated by a particular time interval, called lag. The value of the cross correlation lies between +1 and -1. The closer the cross correlation is to +1 and -1, the more highly correlated are the observations separated by the particular lag being considered. If the correlation is closer to +1, a positive correlation is indicated; if it is closer to -1, a negative correlation exists. In summary, the cross correlation measures the strength of the relationship between the lags of two time s eries.

Line 110

DISPLAY TIME SERIES GRAPH | 1

Choose "1" so that Autobox will provide a text plot of the series.

Line 111

DISPLAY ACF TABLE DURING IDENTIFICATION | 1

Choose "1" to see a table of the correlations at the initial identification stage. The table displays rows of correlations and their standard errors.

Line 112

DISPLAY ACF GRAPH DURING IDENTIFICATION | 1

Choose "1" to see a plot of the correlations at the INITIAL identification stage.

 

Line 113

DISPLAY CCF TABLE DURING IDENTIFICATION | 1

Choose "1" to see the cross-correlations between the prewhitened input and the prewhitened output series. This information is a statistical 'tool' used to identify the form of a transfer model. This option allows you to control whether or not they get reported in a table.

Line 114

DISPLAY CCF GRAPH DURING IDENTIFICATION | 1

Choose "1" to see the cross-correlations between the prewhitened input and the prewhitened output series. This information is a statistical 'tool' used to identify the form of a transfer model. This option allows you to control whether or not they get reported in a plot.

Line 115

DISPLAY PREWHITENING MODEL(S) | 1

Choose "1" to see the model form as a table.

Line 116

DISPLAY PREWHITENING MODEL(S) EQUATION | 1

Choose "1" if you want the program to display the model(s) in the form of an equation.

Line 117

DISPLAY IDENTIFIED MODEL | 1

Choose "1" if you wish to see the model form as a table.

Line 118

DISPLAY IDENTIFIED MODEL EQUATION | 1

Choose "1" if you want the program to display the model(s) in the form of an equation.

Line 119

DISPLAY ESTIMATION INFORMATION | 1

If this is set to "0" then no estimation information will be reported. This means that lines 123-128 would be skipped.

 

 

 

 

Line 120

DISPLAY ESTIMATED MODEL PARAMETERS | 1

Choose "1" if you wish to see the model form as a table.

Line 121

DISPLAY ESTIMATED MODEL EQUATION | 1

Choose "1" if you want the program to display the model(s) in the form of an equation.

Line 122

DISPLAY PARAMETER CORRELATION MATRIX | 1

Choose "1" if you wish to see the Parameter correlation matrix.

Line 123

DISPLAY TRANSFORMED FIT VS ACTUAL TABLE | 1

Your forecasting model may contain a transformation parameter (lambda). Choose "1" so that the program generates fit values and errors in the transformed metric. You may select to have the program display a chart which shows the fit values, the residual values and the actual values from the estimated model.

Line 124

DISPLAY UNTRANSFORMED FIT VS ACTUAL TABLE | 1

Choose "1" so that the program displays a chart which shows the fit values, the residual values and the actual values from the estimated model.

Line 125

DISPLAY DIAGNOSTIC CHECKING INFORMATION | 1

Choose "1" to show detail regarding the diagnostic checking process.

Line 126

DISPLAY RESIDUAL ACF TABLE | 1

Choose "1" to see a table of the residual correlations each time that they are computed, the entry here should be a 'yes'. The table displays rows of correlations and their standard errors.

 

Line 127

DISPLAY RESIDUAL ACF GRAPH | 1

Choose "1" to see a plot of the residual correlations each time that they are computed, the entry here should be a 'yes'.

Line 128

DISPLAY RESIDUAL CCF TABLE | 1

Choose "1" to have Autobox display the cross-correlations between the prewhitened input and the residuals from the current model are the statistical 'tool' used to identify the form of a fixup required to the transfer model. This option allows you to control whether or not they get reported in a table.

Line 129

DISPLAY RESIDUAL CCF GRAPH | 1

The cross-correlations between the prewhitened input and the residuals from the current model are the statistical 'tool' used to identify the form of a fixup required to the transfer model. This option allows you to control whether or not they get reported in a plot.

Line 130

DISPLAY NECESSITY TEST RESULTS | 1

Choose "1" to see the necessity test results. ARIMA modeling may be deficient when the model has too many coefficients.It is important to discard or delete unnecessary structure as it inflates forecast variances, among other things.

Line 131

DISPLAY INVERTIBILITY TEST RESULTS | 1

Choose "1" to see the invertibility test results. ARIMA modeling may be deficient when the model has a non-invertible structure. It is important to discard or delete this structure either by replacing it with differencing or by model restatement..

Line 132

DISPLAY SUFFICIENCY TEST RESULTS | 1

Choose "1" to see the sufficiency test results. ARIMA modeling may be deficient when the MODEL does not have enough structure. The omitted structure can be identified by studying the sample ACF AND PACF of the residuals. In this way we move structure from the residuals to the model.

Line 133

DISPLAY VARIANCE STABILITY TEST RESULTS | 0

Choose "1" to see the variance stability results. ARIMA modeling may be deficient when the series has a non-constant variance. The program will test the residuals from the ARIMA model for possible change points. Essentially interventions are changes in the mean level of the errors while variance stability measures changes in the variance.

Line 134

DISPLAY GRAPH OF RESIDUALS AT EACH STAGE | 0

Choose "1" to see displays of the steps in the variance stability test.

Line 135

DISPLAY WEIGHTS FOR STABILIZING THE VARIANCE | 0

Choose "1" to enable this function. The vector of weights is reported based upon the variance stability test. These weights represent the "degree of belief" that one has in reading or observation. They are relative to each other and provide a way to utilize observations that may have been recorded with different precision.

This function will enable the printing of a separate table of these weights. In addition a disk file called Weights.Out will be prepared. This output file can then be renamed to Weights.In if the user wishes to re-use them in a later session.

Line 136

DISPLAY OUTLIER TEST RESULTS | 1

Choose "1" to enable this function. ARIMA modeling may be deficient when the series has been intervened with. This program will test the residuals from the ARIMA model for possible outlier (intervention) variables. We suggest that you modify either your model or your time series for any outlier variables that may be found. If you have enabled automatic fixup FOR outliers, in the choose analysis options section then these modifications will be done for you automatically. A "yes" shows the details of this process.

Line 137

DISPLAY CONSTANCY TEST | 0

Choose "1" to see the constancy test results. You get a table showing the observations and those values with a significant change in the reliability of the model parameters.

Line 138

DISPLAY FORECASTING INFORMATION | 1

If this is set to "0" then no forecasting information will be reported. This means that lines 143-150 would be skipped.

 

 

 

 

Line 139

DISPLAY MODEL STATISTICS | 1

Choose "1" to see the model statistics (R squared, etc.)

Line 140

DISPLAY FORECAST MODEL PARAMETERS | 1

Choose "1" to see the model form as a table.

Line 141

DISPLAY FORECAST MODEL EQUATION | 1

Choose "1" if you want the program to display the model(s) in the form of an equation.

Line 142

DISPLAY MODEL IN ITS AUTOREGRESSIVE FORM | 0

Choose "1" to see the model displayed as a re-stated pure right-hand side equations (i.e. a distributed lag model). This is useful for model interpretation.

Line 143

DISPLAY TABLE OF TRANSFORMED FORECAST VALUES | 1

Choose "1" to see the forecast in transformed units. Your forecasting model may contain a transformation parameter (lambda). If so, then the program generates forecast values for both the original data and the transformed data.

Line 144

DISPLAY TABLE OF FORECAST VALUES | 1

Choose "1" to see the forecasts with their confidence bounds. Your forecasting model may contain a transformation parameter (lambda). If so, then the program generates forecast values for both the original data and the transformed data.

Line 145

DISPLAY THE INPUT SERIES FORECAST VALUES | 0

Choose "1" to see the values of the input (if any) series. This option is only valid for Transfer Functions.

 

 

 

 

Line 146

DISPLAY GRAPH OF ACTUAL AND FORECAST VALUES | 1

Choose "1" to get a text plot of the forecasts and the actuals.

Line 147

DISPLAY SIMULATED DATA | 0

Choose "1" to display the simulated series.

Line 148

STORE MODEL FORM | 0

Choose "1" to save the model. By saving the model form, you can retrieve it later in order to make a forecast. Some prefer not to remodel after every new observation due to system limitations, but we recommend remodeling at every new data point to capture changes in the process immediately. You need to have a "1" on line 155 for this option to work.

Line 149

OUTLIER SERIES (I~) | 1

Choose "1" to save the outlier series. By saving the outliers to a model, you can retrieve it along with the model form later in order to make a forecast. Some prefer not to remodel after every new observation due to system limitations, but we recommend remodeling at every new data point to capture changes in the process immediately. You need to have a "1" on line 154 for this option to work.

Line 150

RESIDUAL SERIES(MODEL) (_R) | 0

Choose "1" to save the residuals from the modeling process. These values are helpful in understanding how well the model "fit" the data.

Line 151

ESTIMATED/FIT SERIES (_E) | 0

Choose "1" to save the "fit" values from the modeling process. These values are the models attempt to match the actual observations.

Line 152

FORECAST SERIES (_F) | 1

Choose "1" to save the forecast values from the modeling process. These values are what the model expects future values to be.

Line 153

FORECAST SERIES (_L,_U) | 0

Choose "1" to save the confidence bounds around the forecast values from the modeling process. These values are what the model expects the best and worst case scenarios of the future values to be.

Line 154

MOD & DIFF SERIES (_M & _D) | 0

Choose "1" to save the modified and difference series. The modified series is the original time series that has been cleansed for outliers. The difference series is the original actual observations subtracted by the modified series. The difference series shows the net effect of the outliers on the data.

Line 155

DISPLAY MANAGEMENT ANALYSIS | 0

If you want a report that tries to summarize "in english" information about the time series from the model used to fit the data.

| |

CLICK HERE:Home Page For AUTOBOX