1.0 QUESTION : PLEASE PROVIDE AN OVERVIEW OF AUTOBOX'S FORECASTING TECHNIQUES.
AUTOBOX handles a single endogenous equation
incorporating either pre-identified causal series or
empirically identified dummy series which are found to be
statistically significant. The set of pre-identified series
can be either stochastic or deterministic (dummy) in form.
In its search for the most appropriate model form and the
optimal set of paramaters the program can either be:
1. Purely empirical or
2. A starting model could be used.
A final model may require one or more of the following
structures:
1. Power transforms like Log, Square Root, Reciprocal
etc.
2. Variance stabilization due to deterministic changes
in the background error variance.
3. Data segmentation or splitting as evidenced by a
statistically significant change in either model form or
parameters.
Enroute to its tour de force AUTOBOX will evaluate
numerous possible models/parameters that have been suggested
by the data itself. In practice, a realistic limit is set on
the maximum number of model form iterations. The exact
specifics of each tentative model is not pre-set thus the
power of AUTOBOX emerges. The kind and form of the tentative
models may never before been tried. Each dataset speaks for
itself and suggests the iterative process. The Final Model
could be as simple as:
1. A simple trend model or a simple ordinary least
squares model.
2. An exponential smoothing model.
3. A simple weighted average where the weights are
either equal or unequal.
4. A Cochrane-Orcutt or ordinary least squares with a
first order fixup.
5. A simple ordinary least squares model in differences
containing some needed lags.
6. A spline-like set of local trends superimposed with
an arbitrary ARIMA model and perhaps a pulse or two.
The number of possible final models that AUTOBOX could
find is infinite and only discoverable via a true expert
system like AUTOBOX.
A final model may require one or more of the following
seasonal structures:
1. Seasonal ARIMA structure where the prediction depends
on some previous reading S periods ago.
2. Seasonal structure via a complete set of seasonal
dummies reflecting a fixed response based upon the particular
period.
3. Seasonal structure via a partial set of seasonal
dummies reflecting a fixed response based upon the particular
period.
The Final Model will satisfy both:
1. Necessity tests that guarantee the estimated
coefficient is statistically significant.
2. Sufficiency tests that guarantee that the error
process is:
-unpredictable on itself.
-not predictable from the set of causals.
-has a constant mean of zero.
The Final model will contain one or more of the
following structures:
1. CAUSAL with correct lead/lag specification.
2. MEMORY with correct "autoregressive memory".
3. DUMMY with correct pulses, level shifts or spline
time trends
2.0 QUESTION : IS AUTOBOX AUTOMATED, SEMI-AUTOMATED OR MANUAL ?
AUTOBOX provides both automated, semi-automated and
manual capabilities. AUTOBOX has a complete set of
forecasting features that will appeal to both novice and
expert forecasters. Autobox's automatic features are
unparalleled in breadth and depth of implementation. Autobox
is truely the power forecasters dream tool with a pallette of
tools that allows the forecaster to build models that work.
AFS was the first company to automate the BJ model
building process. Our approach is to program the model
identification, estimation and diagnostic feedback loop as
originally described by Box and Jenkins. This is implemented
for both ARIMA (univariate) modeling and Transfer Function
(multivariate or regression) modeling. What this means is
that the user, from novice to expert, can feed Autobox any
number of series and the programs powerful modeling heuristic
can do the work for you. This option is implemented in a
such that it can be turned on at any stage of the modeling
process. There is complete control over the statistical
sensitivities for the inclusion/exclusion of model parameters
and structures. These features allow the user complete
control over the modeling process. The user can let Autobox
do as much or as little of the model building process as you
or the complexity of the problem dictates.
Autobox comes with a complete set of indentification and
modeling tools for use in the BJ framework. This means that
you have the ability to transform or prewhiten the chosen
series for identification purposes. Autobox handles both
ARIMA (univariate) modeling and Transfer Function
(multivariate) modeling allowing for the inclusion of
interventions (see below for more information). Tests for
interventions, need for transformations, need to add or
delete model parameters are all available. Autocorrelation
(both traditional and robust), partial autocorrelation and
cross-correlation functions and their respectives tests of
significance are calulated as needed. Model fit statistics,
including Rē, SSE, variance of errors, adjusted variance of
errors all reported. Information criteria statistics for
alternate model indentification approaches are provided.
One of the most powerful features of Autobox is the
inclusion of Automatic Intervention detection capabilities in
both ARIMA and Transfer Function models. Almost all
forecasting packages allow for interventions to be included
in a regression model. What these packages don't tell you is
how sensitive all forecasting methodologies are to the impact
of interventions or missing variables. These packages don't
tell you if your series may be influenced by missing
variables or changes that are outside the current model. If
a data series is impacted by changes in the underlying
process at discrete points in time, both ARIMA models and
Transfer Function models will produce poor results. For
example, a competitors price change changes the level of
demand for your product. Without a variable to account for
this change you forecast model will perform poorly. Autobox
implements ground breaking techniques which quickly and
accurately indentify potential interventions (level shifts,
season pulses, single point outliers and changes in the
variance of the series). These variables can then be
included in your model at your discretion. The result is
more robust models and greater forecast accuracy.
All forecast packages allow for you to produce forecasts
using the models you have constructed. Autobox presents the
critical information you need to determine if those forecasts
are acceptable. Autobox has options that allow you to
analyze the stability and forecasting ability of your
forecast model. This is achieved through a series of
ex-poste forecast analyses. You can automatically withhold
any number of observations, reestimate the model form and
forecast. Observations are then added back one at a time and
the model is reestimated and reforecast. Forecast accuracy
statistics, including Mean Absolute Percent Error (MAPE) and
Bias, are calculated at each forecast end point. Thus the
stability of the model and its ability to forecast from
various end points can be analyzed. Finally, you can
optionally allow Autobox to actually re-identify the model
form at each level of withheld data to see if the model form
is unduly influenced by recent observations.
3.0 QUESTION : PLEASE EXPLAIN YOUR APPROACH TO CAUSAL MODELLING ?
QUICK ANSWER: AUTOBOX provides a very comprehensive
range of causal models , including but not limited to
incorporating lead effects as well as contemporaneous and lag
effects. It can detect and compensate for changes in
variance , changes in model form and changes in parameters
LONG ANSWER:
Multiple Regression was originally developed for
cross-sectional data but Statisticians/Economists have been
applying it ( mostly incorrectly ) to chronological or
longitudinal data with little regard for the Gaussian
assumptions of constant mean of the errors, constant
variance, identical distribution of the errors and
independence of the errors. AUTOBOX tests for and remedies
any proven violations.
Following is a brief introduction to time series
analysis
Time series = a sequence of observations taken on a
variable or multiple variables at successive points in time.
Objectives of time series analysis:
1. To understand the structure of the time series (how
it depends on time, itself, and other time series variables)
2. To forecast/predict future values of the time series
What is wrong with using regression for modeling time
series?
* Perhaps nothing. The test is whether the residuals
satisfy the regression assumptions: linearity,
homoscedasticity, independence, and (if necessary) normality.
It is important to test for Pulses or one-time unusual values
and to either adjust the data or to incorporate a Pulse
Intervention variable to account for the identified anomaly.
Unusual values can often arise Seasonally , thus one has
to identify and incorporate Seasonal Intervention variables.
Unusual values can often arise at successive points in
time earmarking the need for either a Level Shift
Intervention to deal with the proven mean shift in the
residuals.
* Often, time series analyzed by regression suffer from
autocorrelated residuals. In practice, positive
autocorrelation seems to occur much more frequently than
negative.
* Positively autocorrelated residuals make regression
tests more significant than they should be and confidence
intervals too narrow; negatively autocorrelated residuals do
the reverse.
* In some time series regression models, autocorrelation
makes biased estimates, where the bias cannot be fixed no
matter how many data points or observations that you have.
To use regression methods on time series data, first
plot the data over time. Study the plot for evidence of
trend and seasonality. Use numerical tests for
autocorrelation, if not apparent from the plot.
* Trend can be dealt with by using functions of time as
predictors. Sometimes we have multiple trends and the trick
is to identify the beginning and end periods for each of the
trends.
* Seasonality can be dealt with by using seasonal
indicators (Seasonal Pulses) as predictors or by allowing
specic auto-dependence or auto-projection such that the
historical values ( Y(t-s) ) are used to predict Y(t)
* Autocorrelation can be dealt with by using lags of the
response variable Y as predictors.
* Run the regression and diagnose how well the
regression assumptions are met.
* the residuals should have approximately the same
variance (homoscedasticity) otherwise some form of "weighted"
analysis might be needed.
* the model form/paramaters should be invariant i.e.
unchanging over time. If not then we perhaps have too much
data and need to determine at what points in time the model
form or parameters changed.
Time series data presents a number of
problems/opportunities that standard statistical packages
either avoid or ignore. 1. How to determine the temporal
relationship for each input series ,i.e. is the relationship
contemporaneous, lead or lag or some combination ? ( How to
identify the form of a multi-input transfer function without
assuming independence of the inputs .) 2. How to determine
the arima model for the noise structure reflecting omitted
variables. 3. How to do this in a ROBUST MANNER where pulses,
seasonal pulses , level shifts and local time trends are
identified and incorporated. 4. How to test for and include
specific structure to deal with non-constant variance of the
error process. 5. How to test for and treat non-constancy of
parameters or model form. 6. Do we model the original series
or the differenced series ? AUTOBOX deals with these issues
and more .
A very natural question arises in the selection and
utilization of models. One asks,"Why not use simple models
that provide uncomplicated solutions?" The answer is very
straightforward, "Use enough complexity to deal with the
problem and not an ounce more". Restated, let the data speak
and validate all assumptions underlying the model. Don't
assume a simple model will adequately describe the data. Use
identification/validation schemes to identify the model.
A transfer function can be expressed as a lagged
autoregression in all variables in the model. AUTOBOX
reports this form so users can go directly to spreadsheets
for the purposes that you require. Care should be taken to
deal with gaussian violations such as Outliers (pulses) ,
Level Shifts , Seasonal Pulses , Local time trends , changes
in variance , changes in parameters , changes in models
...... just to name a few ..
4.0 QUESTION : PLEASE DESCRIBE SOME OF THE LIMITATIONS IN AUTOBOX ?
A maximum of 29 input series is allowed but larger
versions are avaialable. A maximum of 600 observations is
allowed but larger versions are avaialable. A maximum of 156
periods can be forecasted but larger versions are avaialable.
5.0 QUESTION : CAN MODELS BE STORED FOR REUSE ?
AUTOBOX delivers the model in a machine-readable fashion
so that it can be reused at a later point.
6.0 QUESTION : DOES AUTOBOX HANDLE OUT-OF-SAMPLE DATA FOR MODEL
TESTING AND/OR MODEL IDENTIFICATION ?
AUTOBOX provides a comprehensive summary of forecasting
performance and allows for forecast errors to be tracked by
origin and by lead time. In a sentence, one 12 period
forecast error is not the same as twelve one period forecast
errors. Kind of obvious, but most people never think of
tracking performance from different origins.
ACTUAL DATA & FOUR PERIOD OUT FORECASTS FROM SIX ORIGINS
(LEAD TIME = 4 ; ORIGINS = 6)
ACTUAL 397 378 472 370 395 427
ORIGIN\DATE 1984/3 1984/4 1984/5 1984/6 1984/7 1984/8
1984/2 308 328 399 355
1984/3 347 472 387 431
1984/4 396 404 426 439
1984/5 421 436 441
1984/6 444 448
1984/7 444
Note: We have 6 estimates of a one-period forecast
error
Note: We have 5 estimates of a two-period forecast
error
Note: We have 4 estimates of a three-period forecast
error
Note: We have 3 estimates of a four-period forecast
error
Measures To Assess Forecast Model Performance
The above table contains all the raw data necessary to
assess how predictable the future is for alternative lead
times. The point is simple and profound. Forecasting
accuracy from a single launch point generates correlated
forecast errors. Forecast error analyses should use a number
of different origins and a number of lead times. To
reiterate, a set of historical values are used to perform
some modeling activity, be it automatic or not, in order to
come up with a model and a set of coefficients. These are
then fixed and each of the withheld observations are then
used as the launching point for a new set of forecasts. With
this approach, the new observations don't fully participate
in the modeling process and only affect the forecast but not
the model form nor its parameters.
Typical Output Tables From AUTOBOX
VALUES ARE IN TERMS OF THE ORIGINAL METRIC
Number of Actuals 3
Forecast Mean Deviation (Bias) 469.327
Forecast Mean Percent Error 1.20294
Forecast Mean Absolute Deviation 1601.47
Forecast Mean Absolute % Error 4.26474
Forecast Variance (Precision) .39e+077
Forecast Bias Squared (Reliability) 220268
Forecast Mean Square Error (Accuracy) .41E+07
Relative Absolute Error .3969
Typical Output Tables From AUTOBOX (MORE)
Lead MEAN MEAN % MEAN MEAN
Time DEVIATION ERROR ABSOLUTE ABSOLUTE %
(BIAS) DEVIATION ERROR
1 .18E+04 4.49 .41E+04 10.75
2 .22E+04 5.49 .36E+04 12.35
3 .38E+04 5.49 .48E+04 15.75
4 .15E+04 2.29 .21E+04 5.75
Typical Output Tables From AUTOBOX (MORE)
Lead VARIANCE BIAS MEAN RELATIVE
Time (PRECISION) SQUARED SQUARE ABSOLUTE
(RELIABILITY) ERROR ERROR
(ACCURACY)
1 .23E+08 .32E+07 .26E+08 .54
2 .24E+08 .36E+07 .24E+08 .62
3 .24E+08 .56E+07 .34E+08 .39
4 .14E+08 .56E+07 .44E+08 .34
ACCURACY = PRECISION + RELIABILITY
We will now define all of these terms so that you can
know how they were computed.
R = A - F and N = NAIVE FORECAST
a) Forecast Mean Deviation (Bias)
The simple average of the errors where bias is the
actual less the forecast.
b) Forecast Mean Percent Error
Expressing the error as a percentage of the actual we
get the percent error. If we average these percentages then
we get the average or mean percent error.
c) Forecast Mean Absolute Deviation
Each bias or error can cancel or offset another. This
statistic disables that potential flaw insofar as it computes
the absolute average disallowing cancellation of the errors.
d) Forecast Mean Absolute % Error
If we now take the simple percent errors and take there
absolute magnitude we can then compute the average or mean
percent error.
e) Forecast Variance
The simple sum of squares of the errors around the
average error is taken and averaged. Often called Precision.
f) Forecast Bias Squared
The overall average error is squared to compute this
statistic. This is often called Reliability.
g) Forecast Mean Square Error
The sum of the errors squared and averaged is often
called Accuracy.
h) Relative Absolute Error
Performance vis-a-vis a random walk prediction is often
a useful measure. Here we sum the absolote errors from the
model and divide it by the sum of absolute errors from a
random walk model.
Additionally , the user can run a tournament and select
that model and set of parameters that minimize an
out-of-sample error criterion such as MAPE or BIAS .
Furthermore the selection can be based on either a prediction
of all the values up to and including a particular lead-time
say 3 periods OR it can be solely based on that individual
lead time rather than the sum of all values up to that lead
time.
7.0 QUESTION : WHAT MODEL DIAGNOSTICS DOES AUTOBOX SUPPLY ?
AUTOBOX reports the standard set of model diagnostics as
explained below.
LONG ANSWER:
Number of Residuals (R) =n
Number of Degrees of Freedom =n-m +
Residual Mean = R/n +
Sum of Squares = R +
Variance var= R /(n) +
Adjusted Variance = R /(n-m) +
Standard Deviation = +
Standard Error of the Mean = / (n-m) +
Mean / its Standard Error = /[ / (n-m)] +
Mean Absolute Deviation = R /n +
AIC Value ( Uses var ) =nln +2m +
SBC Value ( Uses var ) =nln +m*lnn +
BIC Value ( Uses var ) =see Wei p153 +
R Square =1-[ R / (A- A) ] +
These are some of the test's reported:
1 model necessity with respect to estimated parameters
2 model necessity with respect to invertibility
3 automatic fixup with respect to detecting need for
seasonal dummies vis-a-vis seasonal differencing.
4 model sufficiency fixup w/respect to adding ARIMA
structure
5 model sufficiency fixup with respect to adding
outliers (ome-time outliers or pulses seasonal outliers,
level shifts and local time trends )
6 model sufficiency fixup w/respect to incorporating
weights to remedy variance changes not dependent on level
7 model sufficiency fixup w/respect to incorporating
variance stabilizing transormations to remedy variance
changes tied to level changes.
8 model changes over time ... detecting break points.
9.0 QUESTION : WHAT DATA FORMATS DOES AUTOBOX SUPPORT ?
AUTOBOX supports a full free standing and interactive
user who delivers data in either text, EXCEL or .dbf form.
AUTOBOX can also support a dll interface providing all
information to and from AUTOBOX within a user's application.
AUTOBOX provides a range of export facilities providing
full details for normal post-processing of forecasts , models,
interventions detected and equational forms that can be
used in spreadsheets for what-if analysis.
10.0 QUESTION : HOW DOES AUTOBOX ALLOW THE USER TO DEFINE AND USE
ANOMOLUS DATA ?
The overwhelming strength of AUTOBOX lies in its rich
procedures that deal with anomolous data. In particular the
user can specify specific user defined criteria to detect
anomolous data.
Following are the user-controls that would allow a user
to control the detection of anomolous data points.
Outliers can occur in many ways. They may be the result
of a gross error, for example, a recording or transcript
error. They may also occur by the effect of some exogenous
intervention. These can be described by two different, but
related, generating models discussed by Chang and Tiao (1983)
and by Tsay (1986).They are termed the innovational outlier
(IO) and additive outlier (AO) models. AUTOBOX uses the AO
approach due to estimation considerations. ARIMA modeling
may be deficient when the series has been intervened with.
This program will test the residuals from the ARIMA model for
possible outlier (intervention) variables.
The automatic intervention detection option
automatically determines the need for intervention variables
using the residuals from an estimated model and automatically
introduces them into the model.
AUTOBOX recognizes that a sequence of "UNUSUAL VALUES"
s period apart ( e.g. every December ) are not unusual at
all and should be treated not as anomolous but as part of
the prediction pattern.
AUTOBOX recognizes that a sequence of "UNUSUAL VALUES"
that have the same direction and magnitude support the
hypothesis of a level shift and thus are collectively
significant. Note that in practice none of the values may in
and of themselves be siginificant but collectively the
suggest a permanent shift.
AUTOBOX recognizes that a sequence of values may be best
descibed by a simple trend line. AUTOBOX can identify the
beginning and the end of each trend where each local trend
may have different slopes.
Note that anomolous data can sometimes be refeered to as
an inlier. Consider the sequence
1,9,1,9,1,9,5,9,.....
AUTOBOX would identify "5" as being anomolous even
though it was at the average.
We show here an extraction from the AFS document
describing AUTOBOX functionality.
Line 75
SUFFICIENCY (DETERMINISTIC STRUCTURE ) : | 1
Outliers can occur in many ways. They may be the result of a gross error, for example, a recording or transcript error. They may also occur by the effect of some exogenous intervention. These can be described by two different, but related, generating models discussed by Chang and Tiao (1983) and by Tsay (1986).They are termed the innovational outlier (IO) and additive outlier (AO) models. AUTOBOX uses the AO approach due to estimation considerations. ARIMA modeling may be deficient when the series has b een intervened with. This program will test the residuals from the ARIMA model for possible outlier (intervention) variables. We suggest that you modify either your model or your time series for any outlier variables that may be found. The automatic intervention detection option automatically determines the need for intervention variables using the residuals from an estimated model and automatically introduces them into the model.
Line 76
ENABLE THE OUTLIER TEST PRIOR TO ARIMA | 1
Classical INTERVENTION DETECTION required the initial ARIMA filter which would be used to pre-filter the series prior to detecting pulse, seasonal pulse, level shifts or local time trends. There are considerable cases in the literature which would suggest a regression against fixed dummy variables before the ARIMA identification. If this switch is enabled these dummy variables are identified from the original series.
Line 77
CONFIDENCE LEVEL FOR SUFFICIENCY (DS) |90.0
If you select the outlier detection option, then you must specify the confidence limit to be used for detecting possible outlier variables. For example, .80 indicates that the program should identify all outliers that are significant at the 80% level.
Line 78
MAXIMUM NUMBER OF OUTLIERS TO BE IDENTIFIED | 5
You may elect to limit AUTOBOX to a certain number of empirically identified outliers. As delivered, the standard product is limited to a maximum of 5 input series in a transfer function thus this integer can not exceed that limit. AFS sells larger versions which allow up to 19 input series. This feature allows the user to control the incorporation of potentially spurious interventions leading to numerical instability.
Line 79
INCLUDE PULSE VARIABLES | 1
Select "1" to include pulse interventions.
Line 80
ENABLE DYNAMIC PULSE TEST | 1
If pulses arise at consecutive points in time this may be an indicator that a transient intervention variable is more appropriate. If this is enabled and if consecutive pulses are identified then and only then will this possible model re-statement be made.
Line 81
INCLUDE STEP VARIABLES | 1
Choose "1" to include step interventions.
Line 82
MINIMUM NUMBER OF OBSERVATIONS IN GROUP | 2
The number entered determines how many successive values that are on a different level, before Autobox will consider there to be a level shift.
Line 83
INCLUDE SEASONAL PULSE VARIABLES | 1
Choose "1" to include seasonal pulse interventions.
Line 84
INCLUDE LOCAL TRENDS | 0
Choose "1" for Autobox to identify multiple trends.
Line 85
INCLUDE HIDDEN SEASONAL PULSE VAR 0=N 1=Y | 0
Choose "1" to activate this function.
Standard Seasonal Pulse detection uses the user specified periodicity as the key to identifying variables of this form:
SEASONALITY (eg. 12) | 12
Thus the program will limit its search to seasonal pulse variables that have a pattern of 11 0's and then a 1. The program tries all estimable candidates. If you have a series in which you are trying to find hidden deterministic structure, this approach may be insufficient. By enabling this option, the program will attempt to go beyond what might be the expected pattern interval and detect deterministic structure other than the norm. Evidence of patterns not consistent with expected pattern might motiva te further stochastic model investigation by a skilled time series analyst.
Line 86
MAXIMUM INTERVAL TO SCAN |
If you elected to "INCLUDE HIDDEN SEASONAL PULSE VARIABLES" then you must specify the maximum length of the pattern. The upper limit on this integer is 1/2 the length of the series. This input can significantly effect runtime length. Conventional search for a seasonal intervention as empowered by "Include seasonal pulse variables (YES/NO)" performs an exhaustive set of regressions to determine which variable (i.e. start period) is best. For example, if you had 100 observ ations and the data had a known periodicity ("Minor periods per major time interval (eg. 12)) of 12 then 100-12, or 88, models would be estimated. Each model would have a different candidate series. Each of these 88 possible intervention series would have a common seasonal pattern, a one followed by eleven zeroes. This implies that the user knows 'a priori' that this was the case. Consider cases where you wish " to search for hidden periodicities " and wish to allow the seasonal pattern to vary, for example in the range 2 to 12. This would imply that 88 + 89 + 90 +,,, 98 models would have to be tried or 11*(88+98)/2 = 1023 . The compute time can be large and evidenced cycles may be difficult to explain as they might reflect the hidden variable omitted from the model. Omitted variables can create what might be considered unusual lag structure or cyclical intervention variables that act as surrogates for the ever-popular unknown series, which may be uncollectible or unknown to the modeler.
Line 87
ENABLE AUTOMATIC FIXUP FOR SEASONAL DUMMIES | 0
Choose "1" to enable this option to test for the presence of a SEASONAL DETERMINISTIC VARIABLE which has a zero/one pattern according to the following:
a "1" in the corresponding period and a "0" in other periods
The formal test is outlined in Franses paper in the International Journal Of Forecasting, July 1991, pp 199-208 (see the help for the associated Confidence value.
Line 88
CONFIDENCE LEVEL FOR SEASONAL DUMMIES |95.0
If you elected to turn the Seasonal Dummy Test on, then you have the option of specifying the confidence level value that will be used to determine the significance of a parameter. For example, 1.96 indicates that the program will replace a stochastic seasonal difference factor with a set of seasonal dummies. Essentially the null hypothesis is that seasonal differences are appropriate. If one of the S roots is not significant then the seasonal differencing operator is re placed by a seasonal dummy requiring an initial S-1 parameters. If the first root is not significant then regular differences will be included Another aspect of this test is the possible identification of a linear trend series. For more information see Franses (1991).
11.0 QUESTION : HOW DOES AUTOBOX IDENTIFY ANOMOLUS DATA AND SCRUB THE DATA?
AUTOBOX has implented techniques proposed by Tiao, Tsay
and others. Additionally AUTOBOX has extended INTERVENTION
DETECTION to include local time trends.
AUTOBOX first builds a model and then examines the
evidented errors for abmormal residuals that could be
"explained" by Pulses, Seaonal Pulses , Level Shifts and
Local Time Trends. AUTOBOX removes outliers based on
abnormal residuals to a fitted model , not necessarily simple
but containing only significant paratmeters. The OUTLIER
DETECTION scheme does not employ a simple +/ range of
standardized residuals to determine the unusual because
AUTOBOX not only tests for PULSES or one-time onlies but
considers the "group effect" that might arise from a sequence
of contiguous outliers ... each individually being
non-significant but collectively quite significant. This
arise quite naturally in level shifts or local time trends.
To understand the SEARCH mechanism that AUTOBOX uses you
might consider it analagous to stepwise forward procedures
insofar as it searches the sample space , i.e. the possible
solutions , and determines that variable that would , if
included or added , significantly improve the model by
correcting an evidented or proven Gaussian violation.
AUTOBOX recomputes model residuals after each "fix" or
model augmentation thus eliminating the bias caused by the
previously identified anomalies.
The heart of the matter is the General Linear model
where Y is the variable to be predicted and the form of the X
matrix is to be determined empirically. Consider a case
where the noise process is uncorrelated. One could construct
simple regression models, where the trial X variable was
initially 1,0,0,0,0,0,0.....0,0,0,0 and evaluate the error
sum of squares. You could then create yet a second model
where the X variable was 0,1,0,0,0,0,0,0,0,0,0.......0,0,0,0
and evaluate it in terms of the resultant error sum of
squares. If you had 100 observations you would then run 100
regressions and the regression that generated the smallest
error sum of squares would then be the MAXIMUM LIKELIHOOD
ESTIMATE. This process would be repeated until the most
important NEW variable was not statistically significant.
This is essentially STEP-WISE forward regressioon, where the
X variable is found by a search method. Two important
generalizations are needed before you go running off to make
our new competitor; 1. The search for the X variable has to
be extended to include SEASONAL PULSES, LEVEL SHIFTS and TIME
TRENDS and 2. The error process may be other than white
noise, thus one has to iteratively construct TRANSFER
FUNCTIONS rather than multiple regression. The process gets
a little more sticky when you have pre-defined user inputs in
the model.
Outliers and structure changes are commonly encountered
in time series data analysis. The presence of the
extraordinary events could and have misled conventional time
series analysts resulting in erroneous conclusion. The
impact of these events is often overlooked however for the
lack of a simple yet effective means to incorporate these
isolated events. Several approaches have been considered in
the literature for handling outliers in a time series. We
will first illustrate the effect of unknown events which
cause simple model identification to go awry. We will then
illustrate what to do in the case when one knows a priori
about the date and nature of the isolated event. We will
also point out a major flaw when one assumes an incorrect
model specification. Then we introduce the notion of finding
the intervention variables through a sequence of alternative
regression models yielding maximum likelihood estimates of
both the form and the effect of the isolated event. Standard
identification of Arima models uses the sample ACF as one of
the two vehicles for model identification. The ACF is
computed using the covariance and the variance. An outlier
distorts both of these and in effect dampens the ACF by
inflating both measures. Another problem with outliers is
that they can distort the sample ACF and PACF by introducing
spurious structure or correlations. For example consider the
circumstance where the outlier dampens the ACF:
ACF = COVARIANCE/VARIANCE
Thus the net effect is to conclude that the ACF is flat;
and the resulting conclusion is that no information from the
past is useful. These are the results of incorrectly using
statistics without validating the parametric requirements.
It is necessary to check that no isolated event has inflated
either of these measures leading to an "Alice in Wonderland"
conclusion. Various researches have concluded that the
history of stock market prices is information-less. Perhaps
the conclusion should have been that the analysts were
statistic-less. Another way to understand this is to derive
the estimator of the coefficient from a simple model and to
evaluate the effect of a distortion. Consider the true model
as an AR(1) with the following
familiar form:
[ 1 - PHI1 B ] Y(t) = A(t) or Y(t) = A(t)/[ 1 - PHI1 B ]
[ 1 - PHI1 B ] Y(t) = A(t) or Y(t) = PHI1 Y(t-1) + A(t)
The variance of Y can be derived as: variance(Y) =
PHI1*PHI1 variance(Y) + variance(A) thus
PHI1 = SQRT( 1 - variance(A)/variance(Y) )
Now if the true state of nature is where an intervention of form I(t) occurs
at time period t with a magnitude of W we have:
Y(t) = {A(t)/[ 1 - PHI1 B ]}+ W I(t)
with variance(Y) = [PHI1*PHI1 variance(Y) + variance(A)] +
[W I(t)]*[W I(t)] =true variance(Y) + distortion thus
PHI1 = SQRT(1- [var(A) + [W I(t)]*[W I(t)] ]/variance(Y)
The inaccuracy or bias due to the intervention is not
predictable due to the complex nature of the relationship.
At one extreme the addition of the squared bias to
variance(A) would increase the numerator and drive the ratio
to 1 and the estimate of PHI1 to zero. The rate at which
this happens depends on the relative size of the variances
and the magnitude and duration of the isolated event. Thus
the presence of an outlier could hide the true model. Now
consider another option where the variance(Y) is large
relative to variance(A). The effect of the bias is to drive
the ratio to zero and the estimate of PHI1 to unity. A shift
in the mean would generate an ACF that did not die out slowly
thus leading to a misidentified first difference model. In
conclusion the effects of the outlier depend on the true
state of nature. It can both incorrectly hide model form and
incorrectly generate evidence of a bogus model.
These outliers were represented as intervention
variables of the forms: pulse, level shifts and seasonal
pulses. The procedure for detecting the outlier variables is
as follows. Develop the appropriate ARIMA model for the
series. Test the hypothesis that there is an outlier via a
series of regressions at each time period. Modify the
residuals for any potential outlier and repeat the search
until all possible outliers are discovered. These outliers
can then be included as intervention variables in a multiple
input B-J model. The noise model can be identified from the
original series modified for the outliers. AFS has extended
outlier detection to detecting the presence of local time
trends.
This option to the program provides a more complete
method for the development of a model to forecast a
univariate time series. The basic premise is that a
univariate time series may not be homogeneous and, therefore,
the modeling procedure should account for this. By
homogeneous, we mean that the underlying noise process of a
univariate time series is random about a constant mean. If a
series is not homogeneous, then the process driving the
series has undergone a change in structure and an ARIMA model
is not sufficient. The AUTOBOX heuristic that is in place
checks the series for homogeneity and modifies the model if
it finds any such changes in structure. The point is that it
is necessary for the mean of the residuals to be close enough
to zero so that it can be assumed to be zero for all intents
and purposes. That requirement is necessary but it is not
sufficient. The mean of the errors (residuals) must be near
zero for all time slices or sections. This is a more
stringent requirement for model adequacy and is at the heart
of intervention detection. Note that some inferior
forecasting programs use standardized residuals as the
vehicle for identifying outliers. This is inadequate when
the ARIMA model is non-null. Consider the case where the
observed series exhibits a change in level at a particular
point in time.
If you try to identify outliers or interventions in this
series via classical standardized residuals you get one
outlier or one unusual value. The problem is that if you
"fix" the bad observation at the identified time point, the
subsequent value is identified as an outlier due to the
recursive process. The simple-minded approach of utilizing
standardized residuals is in effect identification of
innovative outliers and not additive outliers.
The logic behind the automatic intervention procedure
has its roots in the technique proposed by Chang and Tiao
(1983) and programmed by Bell (1983). It starts by
developing an ARIMA model for the univariate time series
(using the automatic ARIMA algorithm). A series of
regressions on the residuals from the ARIMA model checks for
any underlying changes in structure. If the series is found
to be homogeneous, then the ARIMA model is used to forecast.
If the series is found to be nonhomogeneous, then the various
changes in structure are represented in a transfer function
model by dummy (intervention) input variables and the ARIMA
model becomes the tentative noise model. The program then
estimates the transfer function-noise model and performs all
of the diagnostic checks for sufficiency, necessity and
invertibility. The model is updated as needed, and the
diagnostic checking stage ends when all of the criteria for
an acceptable model are met. The final step is to generate
the forecast values. The user controls the level of detail
that the output report is to contain, as well as some key
options for modeling precision (lambda search and
backcasting, for example). The user can also elect to have
this process start with an examination of the original time
series. This may be necessary for those cases where the
series is overwhelmingly influenced by outlier variables.
We now present a summary of the mathematical properties
underlying this procedure. This is taken from the Downing
and McLaughlin (1986) paper (with permission!). For purposes
of this discussion, we present, in their notation, the
following equation, which is the general ARIMA model:
P(B) (N(t) - MEAN) = CONSTANT + T(B) A(t), (eq. 1) where
N(t) = the discrete time series, MEAN = the average of time
series, P(B) = the autoregressive factor(s), CONSTANT= the
deterministic trend, T(B) = the moving average factor(s),
A(t) = the noise series, and B = the backshift operator.
Outliers can occur in many ways. They may be the result
of a gross error, for example, a recording or transcript
error. They may also occur by the effect of some exogenous
intervention. These can be described by two different, but
related, generating models discussed by Chang and Tiao (1983)
and by Tsay (1986). They are termed the innovational outlier
(IO) and additive outlier (AO) models. An additive outlier
can be defined as,
Y(t) = N(t) + W E(to) (eq. 2)
while an innovational outlier is defined as, Y(t) = N(t)
+ [P(B)/T(B)] W E(to) (eq. 3) where Y = the observed time
series, t in length W = the magnitude of the outlier, E (t )
= 1 if t = to, 0 if t <>to
that is, E (t ) is a time indicator signifying the time
occurrence t o of the outlier, and N is an unobservable
outlier free time series that t follows the model given by
(eq. 1). Expressing Equation (eq. 2) in terms of white noise
series A in Equation (eq. 1), we find that for the AO model
Y(t) = [T(B)/P(B)] A(t) + W E(to), (eq. 4) while for the the
IO model Y(t) = [T(B)/P(B)][ A(t) + W E(to)], (eq. 5)
Equation (eq. 4) indicates that the additive outlier
appears as simply a level change in the t th observation and
is described as a o "gross error" model by Tiao (1985). The
innovational outlier represents an extraordinary shock at
time period to since it influences observations Y(to),
Y(to+1)..... through the memory of the system described by
T(B)/P(B).
The reader should note that the residual outlier
analysis as conducted in the course of diagnostic checking is
an AO type. Also note that AO and IO models are relatable.
In other words, a single IO model is equivalent to a
potentially infinite AO model and vice versa. To demonstrate
this, we expand equation (eq.5) to
Y(t) = [T(B)/P(B)] A(t) + [T(B)/P(B)] W E(to) , (eq. 6)
and then express (eq. 6) in terms of (eq. 4)
Y(t) = [T(B)/P(B)] A(t) + WW E(to) , (eq. 7)
where WW = [T(B)/P(B)] W .
Due to estimation considerations, the following
discussion will be concerned with the additive outlier case
only. Those interested in the estimation, testing, and
subsequent adjustment for innovative outliers should read
Tsay (1986). Note that while the above models indicate a
single outlier, in practice several outliers may be present.
The estimation of the AO can be obtained by forming
II(B) = [T(B)/P(B)] (eq. 8)
and calculating the residuals E(t) by
E(t) = II(B) Y(t) (eq. 9)
= II(B)[ [T(B)/P(B)] A(t) + W E(to) ]
= A(t) + W II(B) E(to) .
By least squares theory, the magnitude W of the additive
outlier can be estimated by
EST of W(to) = n*n II(B) E(to) (eq. 10)
The variance of W(to) is given by:
Var(W(to)) = n*n var(A) (eq. 11)
where var(A) is the variance of the white noise process
A(t) .
Based on the above results, Chang and Tiao (1983)
proposed the following test statistic for outlier detection:
ç(to)= EST W(to) / n sqrt(var(A)). (eq. 12)
If the null hypothesis of no outlier is true, then ç(to)
has the standard normal distribution. Usually, in practice
the true parameters II and åý are unknown, but consistent
estimates exist. Even more important is the fact that to,
the time of the outlier, is unknown, but every time point may
be checked. In this case one uses the statistic:
ç = max absolute value of ç(to) where to goes from 1 to
n (eq. 13) and declares an outlier at time to if the maximum
occurs at to and is greater than some critical value C. Chang
and Tiao (1983) suggest values of 3.0, 3.5 and 4.0 for C.
The outlier model given by Equation (eq. 4) indicates a
pulse change in the series at time to. A step change can
also be modeled
simply by replacing E(to) with S(to) where:
S(to) = 1 if t greater than to (eq. 14)
0 if not
We note that (1-B)S(to) = E(to) . Using S(to) one can
apply least squares to estimate the step change and perform
the same tests of hypothesis reflected in Equations (eq. 12)
and (eq. 13). In this way, significant pulse and/or step
changes in the time series can be detected.
A straightforward extension of this approach to transfer
functions has also been introduced in this version of
AUTOBOX. This, of course, implies that the outliers or
interventions are not only identified on the basis of the
noise filter but the form and nature of the individual
transfer functions.
12.0 QUESTION : HOW DOES AUTOBOX REPORT IT'S RESULTS ? CAN A USER SELECT
THE AMOUNT OF DETAIL ?
Yes ! We show here an extraction from the AFS document
describing AUTOBOX reporting options.
OUTPUT OPTIONS: |
Line 112
DISPLAY IDENTIFICATION INFORMATION | 1
If no detail is required in the initial identification process send a "no". Autocorrelation is a measure of the unconditional dependence that exists between observations in a time series that are separated by a particular time interval, called lag. The value of the autocorrelation lies between +1 and -1. The closer the autocorrelation is to +1 and -1, the more highly correlated are the observations separated by the particular lag being considered. In summary, the autocor relation measures the unconditional relationship between lags. Partial Autocorrelation is a measure of the conditional dependence that exists between observations in a time series that are separated by a particular time interval. The value of the partial autocorrelation lies between +1 and -1 and is evaluated just like the ACF. In summary, the partial autocorrelation measures the conditional correlation between lags. Cross Correlation is a measure of the dependence that exists between observations in two ti me series that are separated by a particular time interval, called lag. The value of the cross correlation lies between +1 and -1. The closer the cross correlation is to +1 and -1, the more highly correlated are the observations separated by the particular lag being considered. If the correlation is closer to +1, a positive correlation is indicated; if it is closer to -1, a negative correlation exists. In summary, the cross correlation measures the strength of the relationship between the lags of two time s eries.
Line 110
DISPLAY TIME SERIES GRAPH | 1
Choose "1" so that Autobox will provide a text plot of the series.
Line 111
DISPLAY ACF TABLE DURING IDENTIFICATION | 1
Choose "1" to see a table of the correlations at the initial identification stage. The table displays rows of correlations and their standard errors.
Line 112
DISPLAY ACF GRAPH DURING IDENTIFICATION | 1
Choose "1" to see a plot of the correlations at the INITIAL identification stage.
Line 113
DISPLAY CCF TABLE DURING IDENTIFICATION | 1
Choose "1" to see the cross-correlations between the prewhitened input and the prewhitened output series. This information is a statistical 'tool' used to identify the form of a transfer model. This option allows you to control whether or not they get reported in a table.
Line 114
DISPLAY CCF GRAPH DURING IDENTIFICATION | 1
Choose "1" to see the cross-correlations between the prewhitened input and the prewhitened output series. This information is a statistical 'tool' used to identify the form of a transfer model. This option allows you to control whether or not they get reported in a plot.
Line 115
DISPLAY PREWHITENING MODEL(S) | 1
Choose "1" to see the model form as a table.
Line 116
DISPLAY PREWHITENING MODEL(S) EQUATION | 1
Choose "1" if you want the program to display the model(s) in the form of an equation.
Line 117
DISPLAY IDENTIFIED MODEL | 1
Choose "1" if you wish to see the model form as a table.
Line 118
DISPLAY IDENTIFIED MODEL EQUATION | 1
Choose "1" if you want the program to display the model(s) in the form of an equation.
Line 119
DISPLAY ESTIMATION INFORMATION | 1
If this is set to "0" then no estimation information will be reported. This means that lines 123-128 would be skipped.
Line 120
DISPLAY ESTIMATED MODEL PARAMETERS | 1
Choose "1" if you wish to see the model form as a table.
Line 121
DISPLAY ESTIMATED MODEL EQUATION | 1
Choose "1" if you want the program to display the model(s) in the form of an equation.
Line 122
DISPLAY PARAMETER CORRELATION MATRIX | 1
Choose "1" if you wish to see the Parameter correlation matrix.
Line 123
DISPLAY TRANSFORMED FIT VS ACTUAL TABLE | 1
Your forecasting model may contain a transformation parameter (lambda). Choose "1" so that the program generates fit values and errors in the transformed metric. You may select to have the program display a chart which shows the fit values, the residual values and the actual values from the estimated model.
Line 124
DISPLAY UNTRANSFORMED FIT VS ACTUAL TABLE | 1
Choose "1" so that the program displays a chart which shows the fit values, the residual values and the actual values from the estimated model.
Line 125
DISPLAY DIAGNOSTIC CHECKING INFORMATION | 1
Choose "1" to show detail regarding the diagnostic checking process.
Line 126
DISPLAY RESIDUAL ACF TABLE | 1
Choose "1" to see a table of the residual correlations each time that they are computed, the entry here should be a 'yes'. The table displays rows of correlations and their standard errors.
Line 127
DISPLAY RESIDUAL ACF GRAPH | 1
Choose "1" to see a plot of the residual correlations each time that they are computed, the entry here should be a 'yes'.
Line 128
DISPLAY RESIDUAL CCF TABLE | 1
Choose "1" to have Autobox display the cross-correlations between the prewhitened input and the residuals from the current model are the statistical 'tool' used to identify the form of a fixup required to the transfer model. This option allows you to control whether or not they get reported in a table.
Line 129
DISPLAY RESIDUAL CCF GRAPH | 1
The cross-correlations between the prewhitened input and the residuals from the current model are the statistical 'tool' used to identify the form of a fixup required to the transfer model. This option allows you to control whether or not they get reported in a plot.
Line 130
DISPLAY NECESSITY TEST RESULTS | 1
Choose "1" to see the necessity test results. ARIMA modeling may be deficient when the model has too many coefficients.It is important to discard or delete unnecessary structure as it inflates forecast variances, among other things.
Line 131
DISPLAY INVERTIBILITY TEST RESULTS | 1
Choose "1" to see the invertibility test results. ARIMA modeling may be deficient when the model has a non-invertible structure. It is important to discard or delete this structure either by replacing it with differencing or by model restatement..
Line 132
DISPLAY SUFFICIENCY TEST RESULTS | 1
Choose "1" to see the sufficiency test results. ARIMA modeling may be deficient when the MODEL does not have enough structure. The omitted structure can be identified by studying the sample ACF AND PACF of the residuals. In this way we move structure from the residuals to the model.
Line 133
DISPLAY VARIANCE STABILITY TEST RESULTS | 0
Choose "1" to see the variance stability results. ARIMA modeling may be deficient when the series has a non-constant variance. The program will test the residuals from the ARIMA model for possible change points. Essentially interventions are changes in the mean level of the errors while variance stability measures changes in the variance.
Line 134
DISPLAY GRAPH OF RESIDUALS AT EACH STAGE | 0
Choose "1" to see displays of the steps in the variance stability test.
Line 135
DISPLAY WEIGHTS FOR STABILIZING THE VARIANCE | 0
Choose "1" to enable this function. The vector of weights is reported based upon the variance stability test. These weights represent the "degree of belief" that one has in reading or observation. They are relative to each other and provide a way to utilize observations that may have been recorded with different precision.
This function will enable the printing of a separate table of these weights. In addition a disk file called Weights.Out will be prepared. This output file can then be renamed to Weights.In if the user wishes to re-use them in a later session.
Line 136
DISPLAY OUTLIER TEST RESULTS | 1
Choose "1" to enable this function. ARIMA modeling may be deficient when the series has been intervened with. This program will test the residuals from the ARIMA model for possible outlier (intervention) variables. We suggest that you modify either your model or your time series for any outlier variables that may be found. If you have enabled automatic fixup FOR outliers, in the choose analysis options section then these modifications will be done for you automatically. A "yes" shows the details of this process.
Line 137
DISPLAY CONSTANCY TEST | 0
Choose "1" to see the constancy test results. You get a table showing the observations and those values with a significant change in the reliability of the model parameters.
Line 138
DISPLAY FORECASTING INFORMATION | 1
If this is set to "0" then no forecasting information will be reported. This means that lines 143-150 would be skipped.
Line 139
DISPLAY MODEL STATISTICS | 1
Choose "1" to see the model statistics (R squared, etc.)
Line 140
DISPLAY FORECAST MODEL PARAMETERS | 1
Choose "1" to see the model form as a table.
Line 141
DISPLAY FORECAST MODEL EQUATION | 1
Choose "1" if you want the program to display the model(s) in the form of an equation.
Line 142
DISPLAY MODEL IN ITS AUTOREGRESSIVE FORM | 0
Choose "1" to see the model displayed as a re-stated pure right-hand side equations (i.e. a distributed lag model). This is useful for model interpretation.
Line 143
DISPLAY TABLE OF TRANSFORMED FORECAST VALUES | 1
Choose "1" to see the forecast in transformed units. Your forecasting model may contain a transformation parameter (lambda). If so, then the program generates forecast values for both the original data and the transformed data.
Line 144
DISPLAY TABLE OF FORECAST VALUES | 1
Choose "1" to see the forecasts with their confidence bounds. Your forecasting model may contain a transformation parameter (lambda). If so, then the program generates forecast values for both the original data and the transformed data.
Line 145
DISPLAY THE INPUT SERIES FORECAST VALUES | 0
Choose "1" to see the values of the input (if any) series. This option is only valid for Transfer Functions.
Line 146
DISPLAY GRAPH OF ACTUAL AND FORECAST VALUES | 1
Choose "1" to get a text plot of the forecasts and the actuals.
Line 147
DISPLAY SIMULATED DATA | 0
Choose "1" to display the simulated series.
Line 148
STORE MODEL FORM | 0
Choose "1" to save the model. By saving the model form, you can retrieve it later in order to make a forecast. Some prefer not to remodel after every new observation due to system limitations, but we recommend remodeling at every new data point to capture changes in the process immediately. You need to have a "1" on line 155 for this option to work.
Line 149
OUTLIER SERIES (I~) | 1
Choose "1" to save the outlier series. By saving the outliers to a model, you can retrieve it along with the model form later in order to make a forecast. Some prefer not to remodel after every new observation due to system limitations, but we recommend remodeling at every new data point to capture changes in the process immediately. You need to have a "1" on line 154 for this option to work.
Line 150
RESIDUAL SERIES(MODEL) (_R) | 0
Choose "1" to save the residuals from the modeling process. These values are helpful in understanding how well the model "fit" the data.
Line 151
ESTIMATED/FIT SERIES (_E) | 0
Choose "1" to save the "fit" values from the modeling process. These values are the models attempt to match the actual observations.
Line 152
FORECAST SERIES (_F) | 1
Choose "1" to save the forecast values from the modeling process. These values are what the model expects future values to be.
Line 153
FORECAST SERIES (_L,_U) | 0
Choose "1" to save the confidence bounds around the forecast values from the modeling process. These values are what the model expects the best and worst case scenarios of the future values to be.
Line 154
MOD & DIFF SERIES (_M & _D) | 0
Choose "1" to save the modified and difference series. The modified series is the original time series that has been cleansed for outliers. The difference series is the original actual observations subtracted by the modified series. The difference series shows the net effect of the outliers on the data.
Line 155
DISPLAY MANAGEMENT ANALYSIS | 0
If you want a report that tries to summarize "in english" information about the time series from the model used to fit the data.
| |