forecasting and time series analysis

QUESTIONS/ANSWERS:

Subject: Forecasting (TS and input variables) Date: Wed, 12 Nov 1997 21:28:46 -0800 Hello sci.stat.math and sci.op-research readers! I am a young OR analyst. I have a few questions regarding forecasting. Most of them are related to input variables.

Please, reply by e-mail (maryse.turcotte@sympatico.ca) and I will summarize to the groups! Thank you very much!

Questions: 1. What are the tests that one can perform to determine whether it is appropriate or not to use

an input variable? Is a test of correlation between the 2 time series sufficient or does there exist some

tests that are especially designed to determine whether or not to include the additional information? (I've

tried with or without an input variable and compared the resulting RMSE, ... on hold-out samples because

adding an input variable always seemed to improve the fitting!)

AFS RESPONSE:

If you omit a needed stochastic input series then the noise or error from the under

-specified model will incorporate or reflect that omission. This aspect of ARIMA structures is well

known. If you omit a needed deterministic series, for example a known intervention , the mean of the

under-specified error process will be effected and thus the variance of the errors untreated will be larger

than necessary leading to a downward bias in test statistics. An example of an omitted stochastic series is

if you omit the earnings of a company in predicting the stock price, history of the stock price becomes

"important" as previous earnings have effected previous stock prices thus the effect of earnings has

already been incorporated by using the history of the stock price. Another example is if you omit

temperature data from a model designed to predict monthly beer sales you will identify a seasonal ARIMA

structure. This structure disappears when you incorporate temperature into the model as an explicit

input series. Thus the ARIMA structure is clearly seen as a proxy for the omitted temperature series.

This aspect of ARIMA structure was pointed out to me by Fernandez in an AER article in 1977. The test

for correlation , as you put it , is potentially flawed by

1. autocorrelation within the input series itself

2. the effect of Pulses , level shifts , Seasonal Pulses and Local time trends on either the input or

the output series.

These effects can be identified and treated using residual diagnostic checking or INTERVENTION

DETECTION procedures. Bartlett warned in 1932 about "why we sometimes get nonsense correlation

between two time series" . Fama erred in the other direction by not accounting for unusual values in his

"proof" that the stock market was a random walk. Whether you use ARIMA structure or the actual X

variable you can only improve the fit or the R-Squared. This is an aspect of the error minimization

process. Whether or not the improvement is statistically significant is available via the likelihood ratio

test (F or T). The real question is whether or not this actually improves the prediction. Care should be

taken to evaluate forecast errors from a number of different origins and for a number of different lead

times. The correct procedure to identify the nature and form of a stochastic input is to pre-whiten the

input series and to pass this filter over the Y series and to compute cross-correlations of these two

proxies. This is done for one and only one reason to IDENTIFY the appropriate model structure. The

literature of TRANSFER FUNCTIONS is appropriate in this regard. Note that this is a tentative

identification and may be flawed by outliers or incorrect model identification. It is necessary to

simultaneously estimate and to examine the residuals for: a. any autocorrelative structure b. whether or

not the residuals can be predicted or modeled by omitted lags in the stochastic series. c. unusual values in

the mean of the errors (INTERVENTION DETECTION) or the variance of the errors (NON-CONSTANT

VARIANCE) . Note that differencing is include ONLY to identify and may or may not be necessary in the

actual transfer function. Early statisticians often detrended or differenced data prior to computing cross

-correlations. These up-front filters are of course subsets of extended ARIMA structures and may be

counter-productive. The form and nature of the correct filter can be identified from the data itself.

Questions: 2. There are some analysis that can be performed prior to choosing a time series forecasting

model. I'm thinking about autocorrelations, partial autocorrelations, ... In your opinion, should one stick

to the model prescribed by the results of these analysis even though there are some other models that

seem to perform better? Is it wise to choose the most easily implementable model of a subset of models

that seem to perform a little better than the others when one is not sure about which one is the most

suitable?

AFS RESPONSE: The techniques of autocorrelations, partial autocorrelations, cross-correlations are all

useful but they are estimated by error minimization procedures. These tools can and often are flawed by

anomalies. Robust identification procedures, particularly for pulses that was described by Masarotto are

often useful. What is even more useful is the INTERVENTION DETECTION procedures and model

diagnostic checking for necessity and sufficiency. One has to identify , sometimes based on priors , a

model and then estimate that model and evaluate its facility to create a gaussian white noise error process

, which means an error term that has a mean of zero everywhere and a variance that is constant. Part and

parcel of this is to test the constancy of the parameters over the fitting period. In summary the modeler

uses sample acf's,ccf's to identify and then tests the estimated parameters for significance and

invariance and makes sure that the error process can not be predicted by any known information such as

lags or leads in the input series or lags in the noise process.

Questions: 3. Aside from RMSE and R*2, are there some statistics that a forecaster should consider as

important?

AFS RESPONSE: The error process should be unpredictable using either its own history or the values in

the X series. In terms of one statistic the AIC is just another, although widely popular , weighted variance

and is judged to be of import. I would examine closely the forecast errors for different lead times from

different origins to assess expected performance. Unfortunately, it is not totally clear whether one should

use BIAS , VARIANCE , RMSE to assess the expected performance. My answer has to do with the loss

function that you have.

Questions: 4. My understanding of an input variable is that: knowing the value of a variable, we can use

that information to improve the accuracy of our forecast. If I have to forecast the value of my input variable

(I don't know it in advance like the value I'm trying to forecast), is it still appropriate to use it? I guess it

is, but I'm afraid that it won't be as efficient...

AFS RESPONSE: One often has to predict the input series in order to predict the output series. Good

statistical packages incorporate the uncertainty in the predictor variables when estimating the

uncertainty in the forecast of the output series.

Questions: 5. How do we select the lag for the input variable? Is the answer the same as the one of

question #1 with lags?

AFS RESPONSE: The selection of the lags (initial selection) is done via cross-correlations of the suitably