====================================================================

This document is a publication of the Center for Statistical and

Mathematical Computing, Indiana University, Bloomington, Copyright

1994. Please credit IU, the Stat/Math Center, and the author when

referring to and/or copying this publication for your own purposes.

====================================================================

^{1. Introduction.}

^{This document is intended to demonstrate how to analyze Autoregressive
Integrated Moving Average (ARIMA) time series models with the statistical
software widely available at IU. For the students or researchers who intend to
analyze ARIMA models, this document shows how to proceed with these software
packages for identification, estimation, diagnosis and forecasting.}

^{Among the four software packages introduced in this document, SPSS, SAS
and SYSTAT are the most popular statistical software for general purpose user.
Readers who are not familiar with SAS and SPSS can obtain introductory
documents on them at the UCS Support Center or StatMath Office. RATS was
originally developed for time series analysis alone and provides a variety of
advanced options not found in the general-purpose statistical software. It
also has the most convenient graphical interface among the software reviewed
here. For those who are not familiar with RATS, an introduction is provided in
the appendix. The readers of this document are assumed to have a basic
knowledge of DOS and the other operating systems. For those who are not
familiar with these operating systems, the introductory document for VMS and
UNIX are also available at UCS Support Center. Knowledge of time series is
required, however, because this document does not provide any explanation about
the technical terms used.}

^{Currently, UCS provides RATS in DOS Public Computing Facilities (PCFs).
Among timesharing systems, RATS is available on AIX (IBM RS/6000), and
SPSS/TREND on three timesharing systems (VMS, Ultrix and AIX). SYSTAT is
installed both in DOS and Mac PCFs. SAS is installed on DOS, VMS, Ultrix and
AIX. Since the mainframe does not support high-resolution graphics for most of
workstations available at public sites, the interactive usage of graphics with
SAS and SPSS is limited compared with RATS and SYSTAT. However, one can
produce a high-resolution graphic (postscript file) using SAS and SPSS on the
mainframe.}

^{This document is organized as follows. In the first section, an example
of a time series is provided. The time series used in this document is the
number of airline passengers per month from January 1949 to December 1960.
This example is widely used in the time series literature because of its
usefulness in discussing the issues such as stationarity, seasonal fluctuation
and differencing. Since the graphical examination of a time series is the best
way to start analyzing the series, the second section shows the series
graphically and discuss the strategy of estimating the parameters of the
series. In the third section, we first analyze the series with RATS. In the
fourth section, SAS is used to generate the same results. The analysis of the
series by SPSS and SYSTAT follows in the fifth and sixth sections. An
introduction to RATS is in the appendix.}

^{2. Data Examination}

^{The data used in the document is as follows:}

^{112 118 132 129 121 135 148 148 136 119 104 118}

^{115 126 141 135 125 149 170 170 158 133 114 140}

^{145 150 178 163 172 178 199 199 184 162 146 166}

^{171 180 193 181 183 218 230 242 209 191 172 194}

^{196 196 236 235 229 243 264 272 237 211 180 201}

^{204 188 235 227 234 264 302 293 259 229 203 229}

^{242 233 267 269 270 315 364 347 312 274 237 278}

^{284 277 317 313 318 374 413 405 355 306 271 306}

^{315 301 356 348 355 422 465 467 404 347 305 336}

^{340 318 362 348 363 435 491 505 404 359 310 337}

^{360 342 406 396 420 472 548 559 463 407 362 405}

^{417 391 419 461 472 535 622 606 508 461 390 432}

^{This data is a single series. In order to save space, it is arranged so
that the first 12 observations (year 1949) are in the first row, the second 12
observations are in the second row, etc. As will be discussed later, each
program has slightly different way of reading in the data formatted in this
way.}

^{A plot of the data is shown below.}

^{As can be seen from this plot, the series is not stationary ( a
stationary series has a constant mean, a constant variance and no trend). Note
that the variance increases for the recent observations. Therefore, it will be
necessary to transform the series into log scale. Also note that the series
shows a strong trend. Therefore, we will need differencing. Finally, note
that the fluctuation becomes larger around December every year. To get rid of
this seasonal fluctuation, we will also need seasonal differencing. This
observation will be confirmed by formal method for identification in the
following sections.}

^{3. Analysis of time series with RATS}

^{The first command for the time series analysis is to read a data file.
The next four commands show how to read an external data file in RATS.}

^{cal 1949 1 12}

^{allocate 144}

^{open data series-g.dat}

^{data(org=obs) / x}

^{Most software for time series analysis has a command for identifying the
start date. The syntax for date command in RATS is "cal year period per year."
Thus, by "cal 1949 1 12," RATS reads the first number (row1, col1) as the
observation of "January 1, 1949," the second number (row1 col2) as the
observation of "February 1, 1949," so on. The reason why the next data is the
observation of "February 1, 1949" is that "12" indicates that the data is
"monthly" data. "allocate" is for setting the default data series length.
Since the length is 144, the last observation will be "December 1, 1960." The
command "open data series-g.dat" directs RATS to read the data file named
"series-g.dat" stored in the current directory. The data file "series-g.dat"
is the text file containing the data shown in section 2. Finally, the command
"data(org=obs) / x" is to let RATS know the structure of the data. "org=obs"
means that the data is organized by observation. Since we have specified only
one variable (x), RATS will read the first row and the first column and assign
the value to the first observations of x, the first row and the second column
and assign the value to the second observations of x and so on. "/" is for
providing the names of the variables in the data. Although the variable in the
data is named as x, you can name it as you wish. The following two commands
are for plotting the series.}

^{graph(header='Airline Passengers',vlabel='Passengers',$}

^{hlabel='1949 - 1961') 1}

^{# x}

^{"graph" is the main command for that (some options for graph are inserted
in the parenthesis but they are not essential). "$" is a continuation sign.
"1" directs RATS to produce one graph. Finally, "# x" makes RATS produce the
graph of the variable x, which was shown earlier in this document.}

^{After you examine the series graphically, you will realize that the
series shows increasing variance, which is a violation of the stationarity
assumption needed for ARIMA modelling. A typical way to get rid of this
problem is to transform the series into log scale. The following command is
for that purpose.}

^{set xlog = log(x)}

^{"set" is a powerful data transformation command in RATS. we can perform
a variety of data transformation with "set" in RATS. Use the above graphics
command with x replaced by xlog now and you will find that the new series shows
the constant variance.}

^{Once the series shows a constant covariance, the next step is to identify
the series. Autocorrelation Function (ACF) and Partial Autocorrelation
Function (PACF) will help us to identify the series. While it is possible to
examine the numerical values of ACF and PACF to identify the series, a more
convenient way to identify the series is to use graphics of ACF and PACF. The
following commands are for producing graphs of ACF and PACF where the values
greater than 2 standard deviation are shaded.}

^{correlate(stderrs=stderrs,partial=partd) xlog1 / corrxd1}

^{set signif = abs(corrxd1/stderrs) > 2.0}

^{graph(style=bargraph,number=0,max=1.0,min=-1.0,shading=signif) 1}

^{# corrxd1}

^{set signif = abs(sqrt(%nobs)*partd) > 2.0}

^{graph(style=bargraph,number=0,max=1.0,min=-1.0,shading=signif) 1}

^{# partd}

^{The first line of the above set of commands directs RATS to calculate ACF
of xlog1 and name it corrxd1. It also calculates the standard error of ACF and
PACF. It named the results stderrs and partd, respectively. The second line
is to find ACF whose value is greater than two standard errors and name it
signif. The third line is to graph ACF with those values greater than two
standard errors shaded. The fourth and fifth lines graph the PACF (the
asymptotic variance of partial autocorrelation is (1/number of observations).
We already know from the plot of the series in section 2 that it has trend and
seasonal fluctuation. This is evidenced from the plots of ACF and PACF of
log(x): the autocorrelations are substantial and well outside two standard
errors. There are two bulges in the ACF plot at lag=1 and lag=12, suggesting
the nonseasonal (monthly) and seasonal (yearly) dependencies.}

^{Lag 1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1}

^{0 | |********************|}

^{1 | *******| . |}

^{2 | . |** . |}

^{3 | ****| . |}

^{4 | . | . |}

^{5 | . |* . |}

^{6 | . |* . |}

^{7 | . *| . |}

^{8 | . | . |}

^{9 | . |**** |}

^{10 | . **| . |}

^{11 | . |* . |}

^{12 | ********| . |}

^{13 | . |*** . |}

^{14 | . *| . |}

^{15 | . |*** . |}

^{16 | . ***| . |}

^{17 | . |* . |}

^{18 | . | . |}

^{xlog is now differenced in order to remove the trend. The last
two commands are for plotting the results.}

^{difference xlog / xlogd1}

^{graph 1}

^{# xlogd1}

^{Since the differenced data still shows seasonal fluctuation (around
December, the number of airline passengers is large due to many holidays in
that period), we difference the series xlogd1 once again (sdiffs=1) with "span
12" and name it xlogd1s1.}

^{difference(span=12,sdiffs=1) xlogd1 / xlogd1s1}

^{The option "span=12" is not necessary because RATS understands that
seasonal difference will be done with lag=12 in this monthly data. In case you
might want other lags for some reason, you can define these explicitly. The
series after all the necessary transformation can be plotted with the following
commands.}

^{graph 1}

^{# xlogd1s1}

^{This series should not show any evidence to violate the assumptions of
stationarity. We can also produce ACF and PACF again using this series
xlogd1s1. The procedure to plot ACF and PACF will be the same as the previous
command for log(x) except that the variable xlog1 is replaced by xlogd1s1.}

^{Since there are ACF spikes at lags 1 and 12, and the PACF shows decay at
lag 1 and 12, it shows the evidence of both MA(1) and SMA(1). Therefore, we
decide that the series is ARIMA(0, 1, 1)x(0, 1, 1). The following two commands
show how to use a RATS module called @bjident, which is an another method of
producing the graphs of ACF and PACF.}

^{source(noecho) bjident.src}

^{@bjident(trans=log, diffs=1, sdiffs=1) x}

^{This is a nice feature in that it produces many combinations of ACF and
PACF with a short command. However, it does not indicate the values greater
than two standard deviation. These commands produce ACF and PACF for the model
with the combination of difference and seasonal difference being 0x0, 0x1, 1x0
and 1x1. You will note that the fourth graph is the same as the one for
xlogd1s1}

^{Once the identification is done, the estimation of ARIMA model in RATS can
be done by using the module "boxjenk."}

^{boxjenk(span=12,diff=1,sdiff=1,MA=1,SMA=1) xlog / resids}

^{The options used are for ARIMA (0,1,1)x(0,1,1) and should be obvious from
the previous discussion. The suboption "resids" after "/" is for saving
residuals to be used for diagnosis, which is shown below.}

^{correlate(stderrs=stderrs,partial=partds) resids / corrresids}

^{set signif = abs(corrresids/stderrs) > 2.0}

^{graph(style=bargraph,number=0,max=1.0,min=-1.0,shading=signif) 1}

^{# corrresids}

^{set signif = abs(sqrt(%nobs)*partds) > 2.0}

^{graph(style=bargraph,number=0,max=1.0,min=-1.0,shading=signif) 1}

^{# partds}

^{As before, these commands calculate ACF and PACF of residuals (resids) and
plot them to see whether we do not reject the hypothesis that they are white
noise. Note that ACF and PACF show the pattern of "almost" white noise. This
means that the model ARIMA(0,1,1)x(0,1,1) is verified and we are now ready to
apply this model to forecasting.}

^{The module bjfore.src is a convenient tool for forecasting. Like the
module @bjident and bjfore, this module provides a compact way to obtain
forecasting. A nice feature of @bjfore is that it produces high-resolution
plot of the forecasted values. It allows one to visually appreciate how the
chosen model ARIMA(0,1,1)x(0,1,1) performs forecasting compared to any other
arbitrary model (try diffs=2, for example, and see the forecasting!).}

^{source(noecho) bjfore.src}

^{@bjfore(trans=log, diffs=1, sdiffs=1, mas=1, sma=1) x 1961:1 1962:1
xfore}

^{print / x xfore}

^{graph 2}

^{# x 1949:1 1960:12}

^{# xfore 1961:1 1962:1}

^{The following is the picture of the forecasting results. Note that the
scale of the forecasting is consistent with the original data and that the
forecast looks reasonable.}

^{3. Analysis of time series with SAS/ETS}

^{The Econometrics Time Series (ETS) for SAS can produce the identical
results we obtained from the above section. To avoid repetition, the
discussion about the characteristics of the series will be shorter and the
emphasis will be on the different aspects of commands in SAS. The following
commands are for reading in the series into SAS and creating a new variable for
date (The commands for SAS are portable. So you can use this example at any
platform).}

^{data seriesg;}

^{infile 'series-g.dat';}

^{input x @@;}

^xlog=log(x);

^{year=int((_n_-1)/12)+1949;}

^{month=mod(_n_-1,12)+1;}

^{date=mdy(month,1,year);}

^{drop year month;}

^{format date monyy.;}

^run;

^{The first line indicates a data step and it is finished with the command
"RUN;". The second line directs SAS to read the data file "series-g.dat" in
the current directory. The third line indicates the variable in the data file.
As before, we name it x. We have "@@" after x so that SAS read the data
continuosly rather than stopping at the first blank. Without @@, SAS will go
to the second line and read the first number and go to the third line, and so
on. "@@" is needed when we want to create multiple observations form a single
record (12 observations from a single record, in our case). The next five
lines are for creating a new variable "date". SAS, unlike RATS, will not
automatically associate dates with values of x, so we have to create date
variables. As we know, the data we are using is a time series from Jan 1949 to
Dec 1960. To create a date variable, we first use SAS internal variable _n_,
which is the number of observation (1, 2, 3,...etc). The second command we use
is INT, which represents integer. Thus, we create a new variable "year", whose
values for first 12 observations are 0 + 1949. The values of the next 12
observations are 1 + 1949, and the last 12 observations are 11 + 1949. The
third command we use for creating date variable is MOD, which is the remainder
when the integer quotient of the first argument divided by the second argument
is calculated. Thus, the value of month for the first observation is 0 + 1,
1+1 for the second observation and 1+1 again for the thirteenth observation,
and so on. The fourth function we use is MDY, which returns a SAS internal
value for a particular month, day, and year. The first value of date is -4017
and the second value is -3986, to which you need not pay attention too much.
The SAS format command "format date monyy." makes SAS to print these internal
date value in the form of Jan49, Feb49, and so on. Once we have this date
variable, we don't need the variables year and month, so we drop them out. One
final comment: If you don't like this procedure of creating date variable, you
can simply use the command "date=_n_," which creates a variable "date" whose
value is the number of the observation. The resulting date variable will not
be "historical" but you can still plot the series with this and examine the
trend. Since most time series analysis software always explicitly defines date
variable, we took some extra steps. The following commands are for plotting
the series x on VMS or UNIX mainframe platform.}

^{proc plot data=seriesg;}

^{plot x*date; run;}

^{To create a Postscript file, which you can print out on the printers in
PCF or to the printer WCCPS at the Wrubel Computing Center, the following group
of command lines can be used.}

^{filename graph1 'test.graph1';}

^{goptions device=ps gsfmode=replace}

^{gsfname=graph1 gsflen=132 nodisplay;}

^{proc gplot;}

^{plot x*date;}

^{If you are interested in graphical interface, the current version of SAS
will not be as good as RATS. For the purpose of identifying the series,
however, high-resolution graphics is not essential because we mainly use ACF
and PACF for identification and SAS is quite convenient as will be shown below.
The following commands is for transforming x into logarithmic scale before
starting estimation.}

^{data seriesg;}

^{set seriesg;}

^xlog=log(x);

^run;

^{Identification, estimation, diagnosis, and forecasting are done in a
single procedure PROC ARIMA in SAS.}

^{proc arima data=seriesg;}

^{identify var=xlog(1,12);}

^{estimate q=(1)(12) noconstant method=cls;}

^{forecast out=b lead=12 id=date interval=month;}

^run;

^quit;

^{The first line is to invoke SAS procedure "arima" with data "seriesg"
which was created in the previous stage. The second command is for
identification. This command produces ACF, PACF and Ljung and Box's Q
statistics for testing the null-hypothesis that the series is white noise.
Since we know that xlog needs 1 differencing and 1 seasonal differencing, we
indicate that in the parenthesis. Note again that the ACF drops to 0 after lag
1 and 12 while PACF tail off after these lags. The third command is for
estimating the series with the model ARIMA(0,1,1)x(0,1,1). Since the series
was already differenced in the previous command "identify," we only need to
indicate MA(1) and SMA(1) in the estimation stage. SAS offers three algorithm
for estimation; cls, uls and ml. For detail, refer to SAS-ETS manual. In this
estimation, we use cls. Since the series is differenced, noconstant is
specified. Once ETS estimates the series, it will also produce the ACF, PACF,
and Ljung and Box's Q statistics for the residual. Examining these values
should lead us not to reject the null hypothesis that the residual is white
noise. Finally, the fourth line is for forecasting. The result will be stored
in the data called "temp". As before, the forecast period is next 12 month
(lead=12). In order to create date variable for the extra 12 month, the
commands "id=date" and "interval=month" are used but these can be omitted. In
the data set "temp", we will have the forecast value of xlog named "forecast."
To obtain the forecast value in the scale of the original value of x, however,
we need to perform transformation of the variable "forecast."}

^{data temp1;}

^{set temp;}

^{forecast=exp(forecast+std*std/2);}

^run;

^quit;

^{The first line of the above command is to create a new data set "temp1",
which is based on data set "temp" created by forecast. The command
"forecast=exp(forecast+std*std/2)" is to convert the forecast value into the
original scale. The adjustment factor exp(std*std/2) is needed when ARIMA is
applied to log transformed data (See SAS-ETS manual).}

^{4. Analysis of time series with SPSS-TREND}

^{SPSS-Trend is a very flexible package for ARIMA modelling and produces
many automatic variables. So we start with a command saving all these
variables. The command requests that all of the automatically generated
variables produced by Trends procedures be saved to the active file. The
second command is for reading the series shown in section 2. The data is
stored in the text file "series-g.dat" in the current directory. "free"
indicates that each data value is separated by space. SPSS reads the first row
until it reaches the end of the line and start with the second line to create a
variable x. The third command is for setting the size of columns. The screen
you will normally use in the campus will be for vt100 emulation and it does not
show 132 columns, which is the default size for SPSS-Trends. This causes some
visual confusion. To avoid this, we set the default size to 80.
Unfortunately, this will not show the entire series in one screen but still
produces visually more comfortable output. If you do not change the default
size and print the output with a line printer, however, you can still see the
entire series in a single plot.}

^{tset newvar=all.}

^{data list file='series-g.dat' free / x.}

^{set width 80.}

^{The next command generates date variables DATE_, YEAR_, and MONTH_
.}

^{date y 1949 m 1.}

^{The examples of values assigned to these variables are followed.}

^{case DATE_ YEAR_ MONTH_;}

^{1 'Jan 1949' 1949 1}

^{15 'Mar 1950' 1950 3}

^{The following command designates a range of observation to be used with
Trends procedure. Since we are using the entire set of observation, this
command is equivalent to the command "use all".}

^{use y 1949 m 1 thru y 1960 m 12.}

^{Now, we plot the variable x along with the date variable. "tsplot"
produces a time series plot. As can be seen from the second command, "tsplot"
is quite flexible in that we can perform log transformation and take
differences in a single command. The subcommand "period=12" is for the period
of seasonal differencing.}

^{tsplot x.}

^{tsplot x / ln / diff / sdiff / period=12.}

^{The next commands are for identification. The first command is the
simplest command for identification. As we know already, this command will
show ACF which are substantial and well outside two standard errors. In
Trends, PACF is created by a subcommand "pacf." So the first command will show
both ACF and PACF. The second command is for showing ACF and PACF after all
the required transformation is done. "ln" means log transformation of x,
"diff=1" means differencing, "sdiff=1" and "period=12" mean seasonal
differencing with period 12.}

^{acf x /pacf.}

^{acf x / ln / diff=1 / sdiff=1 / period=12 / pacf.}

^{Since the above ACF and PACF show the evidence of MA(1) and SMA(1) after
the differencing and seasonal differencing, we estimate the series x with
ARIMA(0,1,1)x(0,1,1). Again, "12" indicates the period of seasonal
differencing, "noconstant" means noconstant in the equation because of
differencing and finally, "ln" means log transformation of the series
x.}

^{arima x /model=(0,1,1)(0,1,1) 12 noconstant ln.}

^{The estimation in Trends produces many automatic variables. Among those,
err_1 is the variable name for the residual. In order to test the hypothesis
that this is white noise, we use the command "acf" again because this command
produces Ljung-Box Q-statistics.}

^{acf err_1 / pacf.}

^{The next commands are for forecasting. "predict" command is used with
procedures "arima" to produce forecasts for time series. "predict" specifies
the observations that mark the beginning and end of the forecast period. For
these periods, Trends add to the end of existing series "date" variable,
forecast variable (variable fit_n) and confidence interval limits. Recall
that we are using the command "arima" second time. Therefore, the default name
for the fitted value (forecast value for the prediction period) is
fit_2.}

^{predict y 1961 m 1 thru y 1961 m 12.}

^{arima x /model=(0,1,1)(0,1,1) 12 noconstant ln.}

^{The following commands are for plotting and observing the forecast value
fit_2. Note that the period specified in the command "use" is now through Dec,
1961. Although variable x will be missing for the extra 12 months, we will
observe the fitted value fit_2 and date variables during this period. One
comment: since the column of the screen is 80, fit_2 will not show along with
the plot and x (you will get a warning message that SPSS could not print
fit_2). In order to see the value of fit_2 with plot, the third command is
added, which does not print x.}

^{use y 1949 m 1 thru y 1961 m 12.}

^{caseplot x fit_2 / format=join.}

^{caseplot fit_2.}

^{5. Analysis of time series with SYSTAT}

^{SYSTAT software comes with several modules and the module for time series
analysis is called "SERIES". In order to read in the data, however, we start
with "DATA."}

^data

^{save seriesg}

^{input x \}

^{get 'series-g.dat'}

^run

^{The first command in DATA step is to create a SYSTAT data file called
seriesg. The next two commands are to define a variable and indicate a data
file (text file) in the current directory. Recall the structure of the data
file 'series-g.dat'. In order to read more than one case per line, the
backslash (\) is appended to the input statement (Recall "@@" for SAS). To
terminate DATA step, "run" is required. The following three commands are to
invoke the module "SERIES," open a SYSTAT data file "seriesg," and to create a
date variable.}

^Series

^{use seriesg}

^{id 1949,12,1}

^{The syntax of the command "id" is "id n1, n2, [n3]", where n1 is the year
of the first observation, n2 is the periodicity, and n3 is the period of the
first observation. An example of labeling quarterly data is "id 1952, 4,
4".}

^{The following commands are for identification. We plot ACF and PACF for
log(x), differenced log(x) (Dlog(x)) and seasonally differenced Dlog(x).
Transformation in SERIES are "in place." That is, the series is stored in the
active work area and the transformed values are written over the old ones. The
original file is not altered, however, because all the work is done in the
memory of the computer. The command syntax for difference is "difference=var /
lag=n." The default is lag=1. [Note: If you want to get rid of transformed
series, use CLEAR. CLEAR clears the series transformation from memory and
restores the original values of the series.]}

^{log x}

^acf

^pacf

^difference

^acf

^pacf

^{difference / lag=12}

^acf

^pacf

^{The next command is for estimation. The seasonally differenced Dlog(x) is
now estimated with one moving average and one seasonal moving average
parameters. The syntax for ARIMA is "ARIMA var / forecast=n, p=n, ps=n, q=n,
qs=n, season=n."}

^{arima / q=1, qs=1, season=12}

^{After the "arima" command has been executed, the residuals are stored in
the active work area under the variable name "residual." we use this variable
for diagnosis as follows.}

^{acf residual}

^{The next commands will create a variable "forecast," which can be saved by
using "save" command. The newly saved SYSTAT data file "forecast" has two
variables "residual" and "forecast."}

^{save forecast}

^{arima / q=1, qs=1, season=12,forecast=12}

^{The following commands are for using the module "DATA" to create date
variable since SYSTAT does not save date variables. The easiest way is to
create date variable in SYSTAT is to use CASE variable, which is internal
variable indicating number of observation.}

^data

^{use forecast}

^{let time=case}

^{save newfore}

^run

^{After creating a new variable "time," a new file "newfore" is created.
Using this file, we can plot forecast values again time variable. The
following commands are for plotting the original series and forecast in SYGRAPH
and the plot will look like the one in section 2.}

^sygraph

^{use newfore}

^{plot newfore*time / line}