====================================================================

This document is a publication of the Center for Statistical and

Mathematical Computing, Indiana University, Bloomington, Copyright

1994. Please credit IU, the Stat/Math Center, and the author when

referring to and/or copying this publication for your own purposes.

====================================================================

1. Introduction.

This document is intended to demonstrate how to analyze Autoregressive Integrated Moving Average (ARIMA) time series models with the statistical software widely available at IU. For the students or researchers who intend to analyze ARIMA models, this document shows how to proceed with these software packages for identification, estimation, diagnosis and forecasting.

Among the four software packages introduced in this document, SPSS, SAS and SYSTAT are the most popular statistical software for general purpose user. Readers who are not familiar with SAS and SPSS can obtain introductory documents on them at the UCS Support Center or StatMath Office. RATS was originally developed for time series analysis alone and provides a variety of advanced options not found in the general-purpose statistical software. It also has the most convenient graphical interface among the software reviewed here. For those who are not familiar with RATS, an introduction is provided in the appendix. The readers of this document are assumed to have a basic knowledge of DOS and the other operating systems. For those who are not familiar with these operating systems, the introductory document for VMS and UNIX are also available at UCS Support Center. Knowledge of time series is required, however, because this document does not provide any explanation about the technical terms used.

Currently, UCS provides RATS in DOS Public Computing Facilities (PCFs). Among timesharing systems, RATS is available on AIX (IBM RS/6000), and SPSS/TREND on three timesharing systems (VMS, Ultrix and AIX). SYSTAT is installed both in DOS and Mac PCFs. SAS is installed on DOS, VMS, Ultrix and AIX. Since the mainframe does not support high-resolution graphics for most of workstations available at public sites, the interactive usage of graphics with SAS and SPSS is limited compared with RATS and SYSTAT. However, one can produce a high-resolution graphic (postscript file) using SAS and SPSS on the mainframe.

This document is organized as follows. In the first section, an example of a time series is provided. The time series used in this document is the number of airline passengers per month from January 1949 to December 1960. This example is widely used in the time series literature because of its usefulness in discussing the issues such as stationarity, seasonal fluctuation and differencing. Since the graphical examination of a time series is the best way to start analyzing the series, the second section shows the series graphically and discuss the strategy of estimating the parameters of the series. In the third section, we first analyze the series with RATS. In the fourth section, SAS is used to generate the same results. The analysis of the series by SPSS and SYSTAT follows in the fifth and sixth sections. An introduction to RATS is in the appendix.

2. Data Examination

The data used in the document is as follows:

112 118 132 129 121 135 148 148 136 119 104 118

115 126 141 135 125 149 170 170 158 133 114 140

145 150 178 163 172 178 199 199 184 162 146 166

171 180 193 181 183 218 230 242 209 191 172 194

196 196 236 235 229 243 264 272 237 211 180 201

204 188 235 227 234 264 302 293 259 229 203 229

242 233 267 269 270 315 364 347 312 274 237 278

284 277 317 313 318 374 413 405 355 306 271 306

315 301 356 348 355 422 465 467 404 347 305 336

340 318 362 348 363 435 491 505 404 359 310 337

360 342 406 396 420 472 548 559 463 407 362 405

417 391 419 461 472 535 622 606 508 461 390 432

This data is a single series. In order to save space, it is arranged so that the first 12 observations (year 1949) are in the first row, the second 12 observations are in the second row, etc. As will be discussed later, each program has slightly different way of reading in the data formatted in this way.

A plot of the data is shown below.

As can be seen from this plot, the series is not stationary ( a stationary series has a constant mean, a constant variance and no trend). Note that the variance increases for the recent observations. Therefore, it will be necessary to transform the series into log scale. Also note that the series shows a strong trend. Therefore, we will need differencing. Finally, note that the fluctuation becomes larger around December every year. To get rid of this seasonal fluctuation, we will also need seasonal differencing. This observation will be confirmed by formal method for identification in the following sections.

3. Analysis of time series with RATS

The first command for the time series analysis is to read a data file. The next four commands show how to read an external data file in RATS.

cal 1949 1 12

allocate 144

open data series-g.dat

data(org=obs) / x

Most software for time series analysis has a command for identifying the start date. The syntax for date command in RATS is "cal year period per year." Thus, by "cal 1949 1 12," RATS reads the first number (row1, col1) as the observation of "January 1, 1949," the second number (row1 col2) as the observation of "February 1, 1949," so on. The reason why the next data is the observation of "February 1, 1949" is that "12" indicates that the data is "monthly" data. "allocate" is for setting the default data series length. Since the length is 144, the last observation will be "December 1, 1960." The command "open data series-g.dat" directs RATS to read the data file named "series-g.dat" stored in the current directory. The data file "series-g.dat" is the text file containing the data shown in section 2. Finally, the command "data(org=obs) / x" is to let RATS know the structure of the data. "org=obs" means that the data is organized by observation. Since we have specified only one variable (x), RATS will read the first row and the first column and assign the value to the first observations of x, the first row and the second column and assign the value to the second observations of x and so on. "/" is for providing the names of the variables in the data. Although the variable in the data is named as x, you can name it as you wish. The following two commands are for plotting the series.

graph(header='Airline Passengers',vlabel='Passengers',$

hlabel='1949 - 1961') 1

# x

"graph" is the main command for that (some options for graph are inserted in the parenthesis but they are not essential). "$" is a continuation sign. "1" directs RATS to produce one graph. Finally, "# x" makes RATS produce the graph of the variable x, which was shown earlier in this document.

After you examine the series graphically, you will realize that the series shows increasing variance, which is a violation of the stationarity assumption needed for ARIMA modelling. A typical way to get rid of this problem is to transform the series into log scale. The following command is for that purpose.

set xlog = log(x)

"set" is a powerful data transformation command in RATS. we can perform a variety of data transformation with "set" in RATS. Use the above graphics command with x replaced by xlog now and you will find that the new series shows the constant variance.

Once the series shows a constant covariance, the next step is to identify the series. Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) will help us to identify the series. While it is possible to examine the numerical values of ACF and PACF to identify the series, a more convenient way to identify the series is to use graphics of ACF and PACF. The following commands are for producing graphs of ACF and PACF where the values greater than 2 standard deviation are shaded.

correlate(stderrs=stderrs,partial=partd) xlog1 / corrxd1

set signif = abs(corrxd1/stderrs) > 2.0

graph(style=bargraph,number=0,max=1.0,min=-1.0,shading=signif) 1

# corrxd1

set signif = abs(sqrt(%nobs)*partd) > 2.0

graph(style=bargraph,number=0,max=1.0,min=-1.0,shading=signif) 1

# partd

The first line of the above set of commands directs RATS to calculate ACF of xlog1 and name it corrxd1. It also calculates the standard error of ACF and PACF. It named the results stderrs and partd, respectively. The second line is to find ACF whose value is greater than two standard errors and name it signif. The third line is to graph ACF with those values greater than two standard errors shaded. The fourth and fifth lines graph the PACF (the asymptotic variance of partial autocorrelation is (1/number of observations). We already know from the plot of the series in section 2 that it has trend and seasonal fluctuation. This is evidenced from the plots of ACF and PACF of log(x): the autocorrelations are substantial and well outside two standard errors. There are two bulges in the ACF plot at lag=1 and lag=12, suggesting the nonseasonal (monthly) and seasonal (yearly) dependencies.

Lag 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1

0 | |********************|

1 | *******| . |

2 | . |** . |

3 | ****| . |

4 | . | . |

5 | . |* . |

6 | . |* . |

7 | . *| . |

8 | . | . |

9 | . |**** |

10 | . **| . |

11 | . |* . |

12 | ********| . |

13 | . |*** . |

14 | . *| . |

15 | . |*** . |

16 | . ***| . |

17 | . |* . |

18 | . | . |

xlog is now differenced in order to remove the trend. The last two commands are for plotting the results.

difference xlog / xlogd1

graph 1

# xlogd1

Since the differenced data still shows seasonal fluctuation (around December, the number of airline passengers is large due to many holidays in that period), we difference the series xlogd1 once again (sdiffs=1) with "span 12" and name it xlogd1s1.

difference(span=12,sdiffs=1) xlogd1 / xlogd1s1

The option "span=12" is not necessary because RATS understands that seasonal difference will be done with lag=12 in this monthly data. In case you might want other lags for some reason, you can define these explicitly. The series after all the necessary transformation can be plotted with the following commands.

graph 1

# xlogd1s1

This series should not show any evidence to violate the assumptions of stationarity. We can also produce ACF and PACF again using this series xlogd1s1. The procedure to plot ACF and PACF will be the same as the previous command for log(x) except that the variable xlog1 is replaced by xlogd1s1.

Since there are ACF spikes at lags 1 and 12, and the PACF shows decay at lag 1 and 12, it shows the evidence of both MA(1) and SMA(1). Therefore, we decide that the series is ARIMA(0, 1, 1)x(0, 1, 1). The following two commands show how to use a RATS module called @bjident, which is an another method of producing the graphs of ACF and PACF.

source(noecho) bjident.src

@bjident(trans=log, diffs=1, sdiffs=1) x

This is a nice feature in that it produces many combinations of ACF and PACF with a short command. However, it does not indicate the values greater than two standard deviation. These commands produce ACF and PACF for the model with the combination of difference and seasonal difference being 0x0, 0x1, 1x0 and 1x1. You will note that the fourth graph is the same as the one for xlogd1s1

Once the identification is done, the estimation of ARIMA model in RATS can be done by using the module "boxjenk."

boxjenk(span=12,diff=1,sdiff=1,MA=1,SMA=1) xlog / resids

The options used are for ARIMA (0,1,1)x(0,1,1) and should be obvious from the previous discussion. The suboption "resids" after "/" is for saving residuals to be used for diagnosis, which is shown below.

correlate(stderrs=stderrs,partial=partds) resids / corrresids

set signif = abs(corrresids/stderrs) > 2.0

graph(style=bargraph,number=0,max=1.0,min=-1.0,shading=signif) 1

# corrresids

set signif = abs(sqrt(%nobs)*partds) > 2.0

graph(style=bargraph,number=0,max=1.0,min=-1.0,shading=signif) 1

# partds

As before, these commands calculate ACF and PACF of residuals (resids) and plot them to see whether we do not reject the hypothesis that they are white noise. Note that ACF and PACF show the pattern of "almost" white noise. This means that the model ARIMA(0,1,1)x(0,1,1) is verified and we are now ready to apply this model to forecasting.

The module bjfore.src is a convenient tool for forecasting. Like the module @bjident and bjfore, this module provides a compact way to obtain forecasting. A nice feature of @bjfore is that it produces high-resolution plot of the forecasted values. It allows one to visually appreciate how the chosen model ARIMA(0,1,1)x(0,1,1) performs forecasting compared to any other arbitrary model (try diffs=2, for example, and see the forecasting!).

source(noecho) bjfore.src

@bjfore(trans=log, diffs=1, sdiffs=1, mas=1, sma=1) x 1961:1 1962:1 xfore

print / x xfore

graph 2

# x 1949:1 1960:12

# xfore 1961:1 1962:1

The following is the picture of the forecasting results. Note that the scale of the forecasting is consistent with the original data and that the forecast looks reasonable.

3. Analysis of time series with SAS/ETS

The Econometrics Time Series (ETS) for SAS can produce the identical results we obtained from the above section. To avoid repetition, the discussion about the characteristics of the series will be shorter and the emphasis will be on the different aspects of commands in SAS. The following commands are for reading in the series into SAS and creating a new variable for date (The commands for SAS are portable. So you can use this example at any platform).

data seriesg;

infile 'series-g.dat';

input x @@;

xlog=log(x);

year=int((_n_-1)/12)+1949;

month=mod(_n_-1,12)+1;

date=mdy(month,1,year);

drop year month;

format date monyy.;

run;

The first line indicates a data step and it is finished with the command "RUN;". The second line directs SAS to read the data file "series-g.dat" in the current directory. The third line indicates the variable in the data file. As before, we name it x. We have "@@" after x so that SAS read the data continuosly rather than stopping at the first blank. Without @@, SAS will go to the second line and read the first number and go to the third line, and so on. "@@" is needed when we want to create multiple observations form a single record (12 observations from a single record, in our case). The next five lines are for creating a new variable "date". SAS, unlike RATS, will not automatically associate dates with values of x, so we have to create date variables. As we know, the data we are using is a time series from Jan 1949 to Dec 1960. To create a date variable, we first use SAS internal variable _n_, which is the number of observation (1, 2, 3,...etc). The second command we use is INT, which represents integer. Thus, we create a new variable "year", whose values for first 12 observations are 0 + 1949. The values of the next 12 observations are 1 + 1949, and the last 12 observations are 11 + 1949. The third command we use for creating date variable is MOD, which is the remainder when the integer quotient of the first argument divided by the second argument is calculated. Thus, the value of month for the first observation is 0 + 1, 1+1 for the second observation and 1+1 again for the thirteenth observation, and so on. The fourth function we use is MDY, which returns a SAS internal value for a particular month, day, and year. The first value of date is -4017 and the second value is -3986, to which you need not pay attention too much. The SAS format command "format date monyy." makes SAS to print these internal date value in the form of Jan49, Feb49, and so on. Once we have this date variable, we don't need the variables year and month, so we drop them out. One final comment: If you don't like this procedure of creating date variable, you can simply use the command "date=_n_," which creates a variable "date" whose value is the number of the observation. The resulting date variable will not be "historical" but you can still plot the series with this and examine the trend. Since most time series analysis software always explicitly defines date variable, we took some extra steps. The following commands are for plotting the series x on VMS or UNIX mainframe platform.

proc plot data=seriesg;

plot x*date; run;

To create a Postscript file, which you can print out on the printers in PCF or to the printer WCCPS at the Wrubel Computing Center, the following group of command lines can be used.

filename graph1 'test.graph1';

goptions device=ps gsfmode=replace

gsfname=graph1 gsflen=132 nodisplay;

proc gplot;

plot x*date;

If you are interested in graphical interface, the current version of SAS will not be as good as RATS. For the purpose of identifying the series, however, high-resolution graphics is not essential because we mainly use ACF and PACF for identification and SAS is quite convenient as will be shown below. The following commands is for transforming x into logarithmic scale before starting estimation.

data seriesg;

set seriesg;

xlog=log(x);

run;

Identification, estimation, diagnosis, and forecasting are done in a single procedure PROC ARIMA in SAS.

proc arima data=seriesg;

identify var=xlog(1,12);

estimate q=(1)(12) noconstant method=cls;

forecast out=b lead=12 id=date interval=month;

run;

quit;

The first line is to invoke SAS procedure "arima" with data "seriesg" which was created in the previous stage. The second command is for identification. This command produces ACF, PACF and Ljung and Box's Q statistics for testing the null-hypothesis that the series is white noise. Since we know that xlog needs 1 differencing and 1 seasonal differencing, we indicate that in the parenthesis. Note again that the ACF drops to 0 after lag 1 and 12 while PACF tail off after these lags. The third command is for estimating the series with the model ARIMA(0,1,1)x(0,1,1). Since the series was already differenced in the previous command "identify," we only need to indicate MA(1) and SMA(1) in the estimation stage. SAS offers three algorithm for estimation; cls, uls and ml. For detail, refer to SAS-ETS manual. In this estimation, we use cls. Since the series is differenced, noconstant is specified. Once ETS estimates the series, it will also produce the ACF, PACF, and Ljung and Box's Q statistics for the residual. Examining these values should lead us not to reject the null hypothesis that the residual is white noise. Finally, the fourth line is for forecasting. The result will be stored in the data called "temp". As before, the forecast period is next 12 month (lead=12). In order to create date variable for the extra 12 month, the commands "id=date" and "interval=month" are used but these can be omitted. In the data set "temp", we will have the forecast value of xlog named "forecast." To obtain the forecast value in the scale of the original value of x, however, we need to perform transformation of the variable "forecast."

data temp1;

set temp;

forecast=exp(forecast+std*std/2);

run;

quit;

The first line of the above command is to create a new data set "temp1", which is based on data set "temp" created by forecast. The command "forecast=exp(forecast+std*std/2)" is to convert the forecast value into the original scale. The adjustment factor exp(std*std/2) is needed when ARIMA is applied to log transformed data (See SAS-ETS manual).

4. Analysis of time series with SPSS-TREND

SPSS-Trend is a very flexible package for ARIMA modelling and produces many automatic variables. So we start with a command saving all these variables. The command requests that all of the automatically generated variables produced by Trends procedures be saved to the active file. The second command is for reading the series shown in section 2. The data is stored in the text file "series-g.dat" in the current directory. "free" indicates that each data value is separated by space. SPSS reads the first row until it reaches the end of the line and start with the second line to create a variable x. The third command is for setting the size of columns. The screen you will normally use in the campus will be for vt100 emulation and it does not show 132 columns, which is the default size for SPSS-Trends. This causes some visual confusion. To avoid this, we set the default size to 80. Unfortunately, this will not show the entire series in one screen but still produces visually more comfortable output. If you do not change the default size and print the output with a line printer, however, you can still see the entire series in a single plot.

tset newvar=all.

data list file='series-g.dat' free / x.

set width 80.

The next command generates date variables DATE_, YEAR_, and MONTH_ .

date y 1949 m 1.

The examples of values assigned to these variables are followed.

case DATE_ YEAR_ MONTH_;

1 'Jan 1949' 1949 1

15 'Mar 1950' 1950 3

The following command designates a range of observation to be used with Trends procedure. Since we are using the entire set of observation, this command is equivalent to the command "use all".

use y 1949 m 1 thru y 1960 m 12.

Now, we plot the variable x along with the date variable. "tsplot" produces a time series plot. As can be seen from the second command, "tsplot" is quite flexible in that we can perform log transformation and take differences in a single command. The subcommand "period=12" is for the period of seasonal differencing.

tsplot x.

tsplot x / ln / diff / sdiff / period=12.

The next commands are for identification. The first command is the simplest command for identification. As we know already, this command will show ACF which are substantial and well outside two standard errors. In Trends, PACF is created by a subcommand "pacf." So the first command will show both ACF and PACF. The second command is for showing ACF and PACF after all the required transformation is done. "ln" means log transformation of x, "diff=1" means differencing, "sdiff=1" and "period=12" mean seasonal differencing with period 12.

acf x /pacf.

acf x / ln / diff=1 / sdiff=1 / period=12 / pacf.

Since the above ACF and PACF show the evidence of MA(1) and SMA(1) after the differencing and seasonal differencing, we estimate the series x with ARIMA(0,1,1)x(0,1,1). Again, "12" indicates the period of seasonal differencing, "noconstant" means noconstant in the equation because of differencing and finally, "ln" means log transformation of the series x.

arima x /model=(0,1,1)(0,1,1) 12 noconstant ln.

The estimation in Trends produces many automatic variables. Among those, err_1 is the variable name for the residual. In order to test the hypothesis that this is white noise, we use the command "acf" again because this command produces Ljung-Box Q-statistics.

acf err_1 / pacf.

The next commands are for forecasting. "predict" command is used with procedures "arima" to produce forecasts for time series. "predict" specifies the observations that mark the beginning and end of the forecast period. For these periods, Trends add to the end of existing series "date" variable, forecast variable (variable fit_n) and confidence interval limits. Recall that we are using the command "arima" second time. Therefore, the default name for the fitted value (forecast value for the prediction period) is fit_2.

predict y 1961 m 1 thru y 1961 m 12.

arima x /model=(0,1,1)(0,1,1) 12 noconstant ln.

The following commands are for plotting and observing the forecast value fit_2. Note that the period specified in the command "use" is now through Dec, 1961. Although variable x will be missing for the extra 12 months, we will observe the fitted value fit_2 and date variables during this period. One comment: since the column of the screen is 80, fit_2 will not show along with the plot and x (you will get a warning message that SPSS could not print fit_2). In order to see the value of fit_2 with plot, the third command is added, which does not print x.

use y 1949 m 1 thru y 1961 m 12.

caseplot x fit_2 / format=join.

caseplot fit_2.

5. Analysis of time series with SYSTAT

SYSTAT software comes with several modules and the module for time series analysis is called "SERIES". In order to read in the data, however, we start with "DATA."

data

save seriesg

input x \

get 'series-g.dat'

run

The first command in DATA step is to create a SYSTAT data file called seriesg. The next two commands are to define a variable and indicate a data file (text file) in the current directory. Recall the structure of the data file 'series-g.dat'. In order to read more than one case per line, the backslash (\) is appended to the input statement (Recall "@@" for SAS). To terminate DATA step, "run" is required. The following three commands are to invoke the module "SERIES," open a SYSTAT data file "seriesg," and to create a date variable.

Series

use seriesg

id 1949,12,1

The syntax of the command "id" is "id n1, n2, [n3]", where n1 is the year of the first observation, n2 is the periodicity, and n3 is the period of the first observation. An example of labeling quarterly data is "id 1952, 4, 4".

The following commands are for identification. We plot ACF and PACF for log(x), differenced log(x) (Dlog(x)) and seasonally differenced Dlog(x). Transformation in SERIES are "in place." That is, the series is stored in the active work area and the transformed values are written over the old ones. The original file is not altered, however, because all the work is done in the memory of the computer. The command syntax for difference is "difference=var / lag=n." The default is lag=1. [Note: If you want to get rid of transformed series, use CLEAR. CLEAR clears the series transformation from memory and restores the original values of the series.]

log x

acf

pacf

difference

acf

pacf

difference / lag=12

acf

pacf

The next command is for estimation. The seasonally differenced Dlog(x) is now estimated with one moving average and one seasonal moving average parameters. The syntax for ARIMA is "ARIMA var / forecast=n, p=n, ps=n, q=n, qs=n, season=n."

arima / q=1, qs=1, season=12

After the "arima" command has been executed, the residuals are stored in the active work area under the variable name "residual." we use this variable for diagnosis as follows.

acf residual

The next commands will create a variable "forecast," which can be saved by using "save" command. The newly saved SYSTAT data file "forecast" has two variables "residual" and "forecast."

save forecast

arima / q=1, qs=1, season=12,forecast=12

The following commands are for using the module "DATA" to create date variable since SYSTAT does not save date variables. The easiest way is to create date variable in SYSTAT is to use CASE variable, which is internal variable indicating number of observation.

data

use forecast

let time=case

save newfore

run

After creating a new variable "time," a new file "newfore" is created. Using this file, we can plot forecast values again time variable. The following commands are for plotting the original series and forecast in SYGRAPH and the plot will look like the one in section 2.

sygraph

use newfore

plot newfore*time / line