Ming Zhong
Doctoral Student
Faculty of Engineering, University of Regina
Regina, SK, Canada, S4S 0A2
Phone: (902) 496-8152
Fax:
(902) 420-5035
Email:
Ming.Zhong@stmarys.ca
Pawan
Lingras*
Professor
Dept. of Mathematics and Computing
Science
Saint
Marys University
Halifax,
NS, Canada, B3H 3C3
Phone:
(902) 420-5798
Fax: (902) 420-5035
Email:
Pawan.Lingras@stmarys.ca
And
Satish Sharma
Professor
Faculty
of Engineering, University of Regina
Regina,
SK, Canada, S4S 0A2
Phone:
(306) 585-4553
Fax: (306) 585-4855
Email: Satish.Sharma@uregina.ca
Ming Zhong, Pawan Lingras,
and Satish Sharma
ABSTRACT: The principle of Base Data
Integrity addressed by both American Association of State Highway and
Transportation Officials (AASHTO) and American Society for Testing and Materials
(ASTM) recommends that missing values should not be imputed in the base data.
However, updating missing values may be necessary in data analysis and helpful
in establishing more cost-effective traffic data programs. The analyses applied
to data sets from two highway agencies show that on average over 50% permanent
traffic counts (PTCs) have missing values. It will be difficult to eliminate
such a significant portion of data from the analysis. Literature review
indicates that the limited research uses factor or autoregressive integrated
moving average (ARIMA) models for predicting missing values. Factor-based
models tend to be less accurate. ARIMA models only use the historical data. In
this study, genetically designed neural network and regression models, factor
models, and ARIMA models were developed to update pseudo-missing values of six
PTCs from Alberta, Canada. Both short-term prediction models and the models
based on data from before and after the failure were developed. Factor models
were used as benchmark models. It was found that genetically designed
regression models based on data from before and after the failure had the most
accurate results. Average errors for refined models were lower than 1% and the
95th percentile errors were below 2% for counts with stable
patterns. Even for counts with relatively unstable patterns, average errors
were lower than 3% in most cases, and the 95th percentile errors
were consistently below 9%. ARIMA models and genetically designed neural
network models also showed superior performance than benchmark factor models.
It is believed that the models proposed in this study would be helpful for
highway agencies in their traffic data programs.
Key words: Missing values, Traffic counts, Genetic algorithms, Time delay neural network, Locally weighted
regression, Autoregressive integrated moving average
INTRODUCTION
Highway agencies commit a
significant portion of their resources to data collection, summarization, and
analysis (Sharma et al. 1996). The
data is used in planning, design, control, operation, and management of traffic
and highway facilities. However, the presence of missing values makes the data
analysis difficult. Without proper imputation methods, traffic counts with
missing values are usually discarded and new counts have to be retaken.
This
study analyzed missing values for the data sets from two highway agencies in
North America. First data set was from Alberta Transportation Department, and
the other was from the Minnesota Department of Transportation (MnDOT). In
Alberta, over seven years, more than half of total counts have missing values.
During some years the percentage is as high as 70% to 90%. A year data from
MnDOT shows more than 40% counts having missing values. Williams et al. (1998) applied seasonal ARIMA and
exponential smoothing models to predict short-term traffic for two study sites
on an urban freeway near Washington, D.C. It was reported that approximately 20
percent of the data in the development and test sets of their study were missing.
Ramsey and Hayden (1994) introduced a computer program AutoCounts used by The
Countywide Traffic and Accident Data Unit (TADU) in England to process
automatic traffic count data. It was found that infill models had to be used to
estimate average flows for more than 50 percent of months for many years at a
study site.
There
are increasing concerns about data imputation and Base Data Integrity. The
principle of Base Data Integrity is an important theme discussed in both
American Society for Testing and Materials (ASTM) Standard Practice E1442, Highway Traffic Monitoring Standards
(America 1991) and the American Association of State Highway and Transportation
Officials (AASHTO) Guidelines for Traffic
Data Programs (America 1992). The principle says that traffic measurements
must be retained without modification and adjustment. Missing values should not
be imputed in the base data. However, this does not prohibit imputing data at
analysis stage. In some cases, traffic counts with missing values could be the
only data available for certain purpose and data imputation is necessary for
further analysis. In accordance with the principle of Truth-in-Data, AASHTO Guidelines (America 1992) also
recommends highway agencies should document the procedures for editing traffic
data.
For
the traffic counts with missing values, highway agencies usually either retake
the counts or estimate the missing values. Estimating missing values is known
as data imputation. Since sometimes retaking counts was impossible due to limited
resources and time, imputing the data became a popular method (Albright 1991a).
For example, it was reported that many highway agencies in the United States
estimated missing values for their traffic counts (New Mexico 1990). In Europe,
highway authorities in Netherlands, France, and the United Kingdom all used
some computer programs for data validation routines. Usually missing or invalid
data was replaced with historical data from the same site during the same
period (FHWA 1997). The experience with data from Alberta Transportation also
indicates that the agency used data imputation before 1995. The replaced values
of missing data were marked with minus signs for some years. Imputing data with
reasonable accuracy may help establish more cost-effective traffic data
program. The analysis of Alberta data also shows that a significant percent
(varying from 10% to 44% from year to year) of traffic counts have missing data
for several successive days or months. Usually these PTCs can not be used to
calculate AADT or DHV due to the missing data. Such PTCs may be used as
seasonal traffic counts (STCs), short-period traffic counts (SPTCs), or just
discarded by highway agencies. However, the information contained in these PTCs
is certainly more than that from STCs and SPTCs. If missing data from PTCs can
be accurately updated, further analysis could be applied based on AADT or DHV.
A
review of literature indicates that little research has been done on missing
values. Most methods used by transportation practitioners were simple factor
approaches or moving average regression analyses (New Mexico 1990; FHWA 1997).
Two studies (Ahmed and Cook 1979; Nihan and holmesland 1980) from the United
States used Box-Jenkins techniques to predict short-term traffic for urban freeways.
The models showed reasonable accuracy. These models can be used to update
missing values for traffic counts. Models developed by Nihan and Holmesland
(1980) were able to predict average weekday volumes for two months, in which
entire monthly data was missing. A group of scholars at University of Leeds,
England, tried to model outliers and missing values in traffic count time
series by employing exponentially weighted moving average, autocorrelation
based influence function, and autoregressive integrated moving average (ARIMA)
models (Clark 1992; Redfern et al.
1993; Waston et al. 1993). It was
found that ARIMA models outperformed other models in detecting missing values
and outliers.
In
this study, factor approaches, time series analysis, and genetically designed
neural network and regression models are tested on six permanent traffic counts
(PTCs) from Alberta, Canada to investigate their abilities of updating missing
values. This study also compares the models based on historical data with
models based on data from before and after failure. The six PTCs belong to
different groups based on the trip purpose and trip length distributions. The
experiments presented in this paper illustrate how to use proposed techniques
to update missing values of these PTCs. The techniques used in this study could
not only be applied to permanent traffic volume counts, but also to seasonal or
short-term traffic volume counts, vehicle classification counts, weight counts,
and speed counts.
LITERATURE REVIEW
There
is significant amount of research related to missing values (Little and Rubin
1987; Bole et al. 1990; Beveridge
1992; Wright 1993; Gupta and Lam 1996; Singh and Harmancioglu 1996). However,
limited research is available on how missing data are handled by transportation
practitioners. Southworth et al.
(1989) introduced a system called RTMAS for urban population evacuations in
times of threat. One subroutine of this system is AUTOBOX, which applies
Box-Jenkins time series model to the hourly or daily traffic count data. AUTOBOX allows complete autoregressive
integrated moving average (ARIMA) modeling. The example in their study clearly
showed that proposed ARIMA model was good at detecting unusual traffic profiles
and was also good at predicting hourly counts. They used past five days data to
predict 24 hourly volumes of the same day of the next week. It was found that
22 hourly volumes were within 95% confidence level of the observed counts. The
other two were detected as outliers caused by an evacuation response to the
threat of Hurricane Elena. Such system
can also be used to predict missing values for traffic counts.
In 1990, New Mexico State
Highway and Transportation conducted a survey of traffic monitoring practice
(New Mexico 1990) in the United States. It was shown that when portable devices
failed, 13 states used some procedure to estimate the missing values and
complete the data set. When permanent devices failed, 23 states employed some
procedure to estimate the missing values (Albright 1991b). Various
methods were used for this purpose. For example, in Alabama, if less than 6
hours are missing, the data are estimated using the previous year or other data
from the month. If more than 6 hours are missing, the day is voided. In
Delaware, estimates of missing values are based on a straight line using the
data from the months before and after the failure. Most of these methods apply
simple factors to historical data to estimate missing values. In Kentucky, a
computer program was used to estimate and fill in the blanks (New Mexico 1990).
In 1997, Federal Highway Administration (FHWA) conducted a research for traffic
monitoring programs and technologies in Europe (FHWA 1997). It was reported
that highway agencies in Netherlands, France, and the United Kingdom used some
computer programs for data validation routines. For example, a software system
INTENS was used in Netherlands for data analysis and validation. The software
used a smart linear interpolation process between locations from which data
were available to estimate missing traffic volumes. In France, a system MELODIE
was used for data validation. Data validation was conducted visually by the
system operator. Invalid data were replaced with previous months data. Several
data validation schemes were used in the United Kingdom. One of them was used
by Central Transport Group (CTG) to validate permanent recorder data. Invalid
data were replaced with data extracted from the valid data of last week
collected from that site. No research has been found for assessing the accuracy
of such imputations.
A series of studies (Clark 1992; Redfern et al. 1993; Watson et al.
1993) were carried out by a group of scholars at University of Leeds, England,
in the early 1990s. Redfern et al.
(1993) tested four types of models on four traffic time series supplied by
Department of Transportation (DOT) in London. These models were exponentially
weighted moving average, autocorrelation based influence function, ARIMA model
using large residuals, and ARIMA model using the Tsay likelihood ratio
diagnostics. It was reported that the estimation of replacement values for both
extreme and missing values was most efficiently done using the parametric
ARIMA(1,0,0)(0,1,1)7 model. However, it was also reported that the
estimated replacements of the missing values showed considerable variation
(Redfern et al. 1993). The study also
mentioned concerns about the Base Data Integrity.
A survey of practical solutions used by consultancies and local
authorities in England (Redfern et al.
1993) reported that there were two broad categories of solutions. One is
by-eye method and the other is computerized packages (Redfern et al. 1993). Most automated practical
solutions to patching were based upon simple, moving or exponentially weighted
moving average, or their variants. For
example, DOT in London employed an exponentially weighted moving average model
to update missing values. The process involved validating new traffic count
data against old data from the same site collected over the previous weeks at
the same time. Following equation was used to estimate missing or rejected
data, , at time t:
(1)
where xt-1,s, xt-2,s,
, xt-n,s represent
the observations for that particular site and vehicle category at the same
times for weeks 1, 2,
, n before the
current observation; is a constant such
that 0<<1. A value of 0.7 was typically used for parameter .
The Countywide Traffic and Accident Data Unit (TADU) used
AutoCounts to validate collected data and infill missing values from automatic
traffic counts (Ramsey and Hayden 1994). The agency needs monthly five- and
seven-day flow averages for trend and yearly analysis within AutoCounts.
Usually these statistics can be obtained directly from the validated data that
have been flagged as typical. However, when there are no typical data the
infill model is applied. The model
estimates weekly flows, and starts with a seasonal profile where all weeks are
considered to be equal. Then, considering the data in ascending order of age by
year, the profile is modified each year. As a starting point the previous
years profile is calculated as follows:
(2)
The model is applied on a week-to-week basis for: w = 1 to 53 and = w +1. Here FWw is the actual weekly flow for the week w; fw
is the estimated weekly seasonal factor for week w; f42 (w = 42), mid-October, is always
1.0. The model is applied iteratively
either a maximum of 50 times or until no improvement in fit is achieved. The
output of the process is a full 53-week flow profile for the year under
consideration. No evaluations were made on the accuracy of such models (Ramsey
and Hayden 1994).
This section provides a brief review of factor
approaches, time series analysis, regression analysis, neural networks, and
genetic algorithms used in the present study.
Factor
approaches may be the most popular data imputation or prediction methods.
Factor approaches usually involve developing a set of factors from historical
data set and then applying these factors to new data for predictions. For
example, a set of hourly factors (HF), daily factors (DF), and monthly factors
(MF) can be developed based on data from permanent traffic counts. Traffic
parameters, such as AADT and DHV, then could be predicted by applying these
factors to short-period traffic counts (Garber and Hoel 1999). The virtue of
such methods is their simplicity. However, the results are usually less
accurate than more sophisticated models.
A time series is a
chronological sequence of observations on a particular variable. Time series
data are often examined in hopes of discovering a historical pattern that can
be exploited in the forecast. Time series modeling is based on the assumption
that the historical values of a variable provide an indication of its value in
the future (Box and Jenkins 1970).
Many techniques are available for modeling univariate
time series, such as exponential smoothing, Holt-Winters procedure, and
Box-Jenkins procedure. Exponential smoothing should only be used for
non-seasonal time series showing little visible trend. Exponential smoothing
may easily be generalized to deal with time series containing trend and
seasonal variation. The resulting procedure is usually referred to as the
Holt-Winters procedure. Box-Jenkins procedure is the most popular tool for time
series analysis. The procedure builds an autoregressive integrated moving
average (ARIMA) model using the Box-Jenkins methodology. Both autoregressive
and moving average components are considered in these models. Such a model is
called an integrated model because
the stationary model that is fitted to the differenced data has to be summed or
integrated to provide a model for the non-stationary data (Chatfield 1989). The
general autoregressive integrated moving average process is of the form:
Given
(3)
(4)
Where: Xt
is a non-stationary process;
Wt
is a stationary process;
Ρ is the differencing
operator;
Zt
is white noise;
at and bt are constants;
p, d, q are the order of autoregressive, differencing, and
moving average components.
The
above ARIMA process describing the dth
differences of the data is said to be of order (p, d, q), usually referred to as ARIMA(p, d, q). An ARIMA model considering seasonality in the data is
often represented by ARIMA(p, d, q)(P, D, Q)s, where P,
D, and Q are the order of seasonal autoregressive, differencing, and
moving average components; s is a
seasonal periodic component that repeats every s observations.
A
variant of regression analysis called locally weighted regression was used in
this study. Locally weighted regression is a form of instance-based (or
memory-based) algorithm for learning continuous mappings from real-valued input
vectors to real-valued output vectors. Local methods assign a weight to each
training observation that regulates its influence on the training process. The
weight depends upon the location of the training point in the input variable
space relative to that of the point to be predicted. Training observations
closer to the prediction point generally receive higher weights (Friedman
1995). The local weighted regression program used in this study can be
downloaded from a web site (Locally 2001).
Model-based methods, such as neural networks and the mixture of Gaussians, use the data to build a parameterized model. After training, the model is used for predictions and the data are generally discarded. In contrast, memory-based methods are non-parametric approaches that explicitly retain the training data, and use it each time a prediction needs to be made. Locally weighted regression is a memory-based method that performs regression around a point of interest using only training data that are local to that point. One recent study demonstrated that locally weighted regression was suitable for real-time control by constructing a locally-weighted-regression-based system that learned a difficult juggling task (Schaal and Atkeson 1994).
The neural networks used in this study consist of three layers: input, hidden, and output. The input layer receives data from the outside world. The input layer neurons send information to the hidden layer neurons. The hidden neurons are all the neurons between the input and output layers. They are part of the internal abstract pattern, which represents the neural networks solution to the problem. The hidden layer neurons feed their output to the output layer neurons, which provide the neural networks response to the input data.
The variant of neural network used in this study is called time delay
neural network (TDNN) (Hecht-Nielsen 1990). Figure 1 shows an example of a
TDNN, which are particularly useful for time series analysis. The neurons in a
given layer can receive delayed input from other neurons in the same layer. For
example, the network in Figure 1 receives a single input from the external
environment. The remaining nodes in the input layer get their input from the
neuron on the left delayed by one time interval. The input layer at any time
will hold a part of the time series. Such delays can also be incorporated in
other layers.
Neurons process input and produce output. Each
neuron takes in the output from many other neurons. Actual output from a neuron
is calculated using a transfer function. In this study, a sigmoid transfer
function is chosen because it produces a continuous value in the range [0,1]. A neuron in a given layer is
connected to neurons (n1,
n2
nm) in the previous layer. The
connection from to has the weight . The weights of the
connections are initially assigned an arbitrary value between 0 and 1. The
appropriate weights are determined during the training phase. Input to the is obtained using the
following equation:
|
(5) |
Output from the is calculated using a
sigmoid transfer function as:
|
(6) |
It is necessary to train a neural network model
on a set of examples called the training set so that it adapts to the system it
is trying to simulate. Supervised learning is the most common form of
adaptation. In supervised learning, the correct output for the output layer is
known. Output neurons are told what the ideal response to input signals should
be. In the training phase, the network constructs an internal representation
that captures the regularities of the data in a distributed and generalized
way. The network attempts to adjust the weights of connections between neurons
to produce the desired output. The back-propagation method is used to adjust
the weights, in which errors from the output are fed back through the network,
altering weights as it goes, to prevent the repetition of the error.
The origin of genetic algorithms (GAs) is attributed to Hollands work
(Holland 1975) on cellular automata. There has been significant interest in GAs
over the last two decades (Buckles and Petry 1994). The genetic algorithm is a
model of machine learning, which derives its behavior from a metaphor of the
processes of evolution in nature. This is done by the creation within a machine
of a population of individuals represented by chromosomes, in essence a set of
character strings that are analogous to the base-4 chromosomes in human DNA.
The individuals in the population then go through a process of evolution.
In practice, the evolutionary model of
computation can be implemented by having arrays of bits or characters to
represent the chromosomes , where ci
is called a gene. Simple bit manipulation
operations allow the implementation of crossover, mutation and other
operations. When genetic algorithms are implemented, they are usually done in a
manner that involves the following cycle: Evaluate the fitness of all of the
individuals in the population; Create a new population by performing operations
such as crossover, fitness-proportionate reproduction and mutation on the
individuals whose fitness has just been measured; Discard the old population
and iterate using the new population.
The
first generation (generation 0) of this process operates on a population of
randomly generated individuals. From there on, the genetic operations, in
concert with the fitness measure, operate to improve the population.
Genetic Algorithms for Designing Neural
Networks
Many
researchers have used GAs to determine neural network architectures. Harp, et al. (1989)
and Miller, et al. (1989)
used GAs to determine the best connections among network units. Montana and
Davis (1989) used GAs for training the neural networks. Chalmers (1991)
developed learning rules for neural networks using GAs.
Hansen, et al. (1999)
used GAs to design time delay neural networks (TDNN), which included the
determination of important features such as number of inputs, the number of
hidden layers, and the number of hidden neurons in each hidden layer. Hansen, et al. (1999)
applied their networks to model chemical process concentration, chemical
process temperatures, and Wolfer sunspot numbers. Their results clearly showed
advantages of using TDNN configured by GAs over other techniques including
conventional autoregressive integrated moving average (ARIMA) methodology as
described by Box and Jenkins (1970).
Hansen
et al.s approach (1999) consisted of
building neural networks based on the architectures indicated by the fittest
chromosome. The objective of the evolution was to minimize the training error.
Such an approach is computationally expensive. Another possibility that is used
in this study is to choose the architecture of the input layer using genetic
algorithms.
Lingras
and Mountford (2001) proposed the maximization of linear correlation between
input variables and the output variable as the objective for selecting the
connections between input and hidden layers. Since such an optimization is not
computationally feasible for large input layers, GAs were used to search for a
near optimal solution. It should be
noted here that since the input layer has a section of time series, it is not
possible to eliminate intermediate input neurons. They are necessary to
preserve their time delay connections. However, it is possible to eliminate
their feedforward connections. Lingras and Mountford (2001) achieved superior
performance using the GAs designed neural networks for the prediction of inter-city
traffic. The present study uses the same objective function for development of
regression and neural network models. The developed models were used to update
missing values of traffic counts.
STUDY DATA
Currently, Alberta Transportation employs about 350 permanent traffic
counters (PTCs) to monitor its highway networks. Hierarchical grouping method
proposed by Sharma and Werner (1981) was used to classify these PTCs into
groups. The ratios of monthly average daily traffic (MADT) to annual average daily
traffic (AADT) (known as monthly factor MF
= MADT/AADT) were used to represent the highway sections monitored by these
PTCs during the classification. After studying group patterns from 1996 to
2000, five groups seemed appropriate to represent study data. These groups are
labeled as commuter, regional commuter, rural long-distance, summer
recreational, and winter recreational groups. Figure 2 shows the grouping
results. It can be seen that commuter group has a flat yearly pattern due to
stable traffic flows across the year. Regional commuter and rural long-distance
groups show higher peaks in the summer and lower troughs in the winter.
Recreational group has the sharpest pattern and highest peak in the summer. The
largest monthly factor (in August) is about 6 times the smallest monthly factor
(in January) for recreational group. Winter recreational group shows an
interesting yearly pattern the peak occurred in winter season (from December
to March).
Six counts were selected from various groups: two from the commuter
group, two from the regional commuter group, one from the rural long-distance,
and one from the recreational group. Due to insufficient data in winter
recreational group, no counts were selected from that group. Table 1 shows PTCs
selected from different groups, their functional classes, AADT values, and
training and test data used in this study.
Figure
3 shows daily patterns for these counts. For commuter group counts (C011145t
and C002181t) there are two peaks in a day: one is in the morning, and the
other is in the afternoon. Regional commuter count C022161t also has two
peaks in a day, but they are smaller than commuter counts. Even though C003061t
was classified into regional commuter group based on its yearly pattern, its
daily pattern is very similar to that of rural long-distance count
C001025t. The daily patterns of both
C003061t and C001025t have two very small peaks. However, the first peak
occurred nearly at noon, instead of in the early morning. Recreational count
C093001t only has one peak occurring nearly at noon. Majority of recreational
travel took place in a few hours in the afternoon.
For
each count, four or five years data was used in the experiments, as shown in
Table 1. Five years data was used for counts from the groups other than
recreational. Since there were large number of missing values in 1999 data for
C093001t, only four years data is available. There are no missing values in the
experimental data. The data is in the form of hourly traffic volumes for both
directions.
STUDY MODELS, RESULTS, AND DISCUSSION
STUDY MODELS
The models were trained and tested by assuming that a certain portion
of the data was missing. Various models were applied to estimate missing values
from six PTCs. This section gives a brief description of the models developed
in this study.
Genetically Designed
Regression and Neural Network Models
Two
types of genetically designed models were developed in this study. First type
consisted of short-term prediction models, which only used the data before the
failure as the input. For this type of models, one-week long (7 ΄ 24 = 168) hourly volumes before the first
missing value were used as the candidate inputs. The second type of models used
the data from before and after the failure as the input. For models that used
the data from before and after the failure, a week-long hourly volumes from
each side of the occurrence of missing value(s) were used as the candidate
inputs. Totally 168 ΄ 2 = 336 hourly volumes were
presented to GAs for selecting 24 final inputs.
Genetically
designed regression and neural network models were applied to estimate missing
values from traffic counts. If only one hourly volume was missing, models were
only applied once to update that missing value. If there were more than one successive
missing values, models were applied recursively for estimating missing values.
Figure 4 shows the prototype of models used in this study.
First,
assuming there was one or more than one successive missing values in traffic
counts, candidate inputs of models were presented to GAs for selecting 24 final
input variables. These 24 hourly
volumes were chosen because they have the maximum correlation with the traffic
volume of next hour, among all the combinations of 24 variables from candidate
inputs. The next hour here is the hour whose volume will be predicted based on
GAs selected 24 inputs. The GAs selected variables were used to train the
neural network and regression models for traffic prediction of the next hour.
The trained neural network or regression models were used to estimate missing
traffic volume of first hour P1. If
there were more than one successive missing values, same techniques would be
used to predict second missing value P2.
However, at this stage, the candidate pattern presented to GAs for selecting
final inputs included estimated volume of first hour P1, as shown in Figure 4. P1
may or may not be chosen as final input because there are different input
selection schemas for different hourly models. Figure 5 shows a TDNN model with
inputs selected from a week-long hourly-volume time series. Corresponding
regression model also used same inputs for prediction.
A top-down model design (Zhong et al. 2002) was used to search the
models with reasonable accuracy. First 24-hour universal models were
established to test their ability, then they were further split into 12-hour
universal models, single-hour models, seasonal single-hour models, day-hour
models, and seasonal day-hour models. The characteristics of these models are
as follows:
1.
24-hour
universal models: This approach involved a single TDNN and a single regression
model for all the hours across the year. The advantage of such models is the
simplicity of implementation during the operational phase.
2.
12-hour
universal models: In order to improve models accuracy, 12-hour universal
models were built based on 12-hour (from 8:00 a.m. to 8:00 p.m.) observations.
In other words, the observations from late evenings and early mornings were
eliminated from the models.
3.
Single-hour
models: 12-hour universal models were further split into 12 single-hour models.
That is, every hour had a separate model.
4.
Seasonal
single-hour models: Seasons have definite impact on the travel. So further
dividing single-hour models into seasonal models may improve models accuracy.
Yearly single-hour models were further split into May-August single-hour models
and July-August single-hour models.
5.
Day-hour
models: Travel patterns also vary by the day of the week. Further
classification of observations into groups for different days (e.g., Wednesday
or Saturday) may improve models accuracy.
6.
Seasonal
day-hour models. Day-hour models were further split into seasonal day-hour
models (e.g., July-August day-hour models) to explore the models with higher
accuracy.
The model refinement studied
by Zhong et al. (2002) was further
extended in this study. Genetic algorithms were used to identify 24 variables
from the input time series that have the highest linear correlation with the
output variable. Depending upon their position in the input time series, the
candidate-input variables were labeled from 1 to 168 for models that used
historical data, or 1 to 336 for models based on data from before and after
failure. Each gene was allowed to take a value from 1 to 168, or 336. Each chromosome
had 24 genes. The chromosomes with higher values of linear correlation were
selected for creating the next generation. The population size was set to 110.
The genetic algorithms were allowed to evolve for 1000 generations. The
crossover rate was set at 90%, and the probability of mutation was set to 1%.
The best chromosome from 1 to 1000 generations was used as the final solution
of the search. The connections selected by the genetic algorithms were used to
design and implement the neural network and regression models. That is, the
coefficients or weights of the 24 final selected input variables were nonzero
and the coefficients or weights of all the other input variables in the time
series were set to zeroes in the regression or neural network models.
Factor Models
Various
factor models were developed for the six experimental counts. These models
were:
1.
Average-history
model: Historical data of the same hour from training set were averaged to
update missing values in the test set. It is assumed that there is no change in
traffic volume from year to year. The replacement values of missing data were
calculated as:
. (7)
Where:
N is the number of years in training
set.
2.
Factor-history
models: This model added growth factor to equation (7). The replacement values
were calculated as weighted average of historical data. Growth factors were
used as weights in the calculations. The growth factors (GF) were calculated by using the ratio of AADT of the test year and
AADTs of years in the training set. The replacement values of missing data were
calculated as:
(8)
.
(9)
Where: AADTitraining
is AADT of year i in training set;
AADTtest is AADT of the test year;
N is the number of years in training set.
3.
Single-hour
monthly factor models: Monthly factors from the training set were used in test
set to calculate the replacement values. The monthly factors from the months
before and after the failure month were used in the calculations as:
(10)
Where: mfi is
average monthly factor of the failure month calculated from training set;
mfi-1 and mfi+1 are average monthly factors of the month before
and after the failure month respectively, which are also calculated from
training set;
Valuei-1 and Valuei+1 are hourly volumes of the same hour in the
months before and after the failure month respectively.
4.
Day-hour
monthly factor models: The previous model does not incorporate weekday impact.
The day-hour monthly factor models used the same equation as single-hour
monthly factor models. However, Valuei-1
and Valuei+1 were hourly
volumes from the same hour and day of the week.
Autoregressive Integrated
Moving Average (ARIMA) Models
Two
types of ARIMA models were developed in this study. The characteristics of
these models were as follows:
1.
Seasonal
ARIMA models for updating one-week long missing values: Previous 8 weeks data
were used to estimate 168 missing hourly volumes of the 9th week.
2.
July-August
day-hour ARIMA models for updating 12 successive hour missing values: Previous
8 days (same day of the week) data were used to predict 12 missing hourly
volumes of the 9th day (from 8:00 a.m. to 8:00 p.m.).
All the models were
trained and tested. Depending on the model, the number of patterns or
observations varied. The absolute percentage error was calculated as:
|
(11) |
The key evaluation
parameters consisted of the average, 50th, 85th and 95th
absolute percentile errors. These statistics give a clear profile of models
error distribution.
RESULTS AND DISCUSSION
Various
models described in previous section were tested on the data from six PTCs.
This paper only presents some of the key results for illustration. For example,
the results from initial models with low accuracy are only summarized for one
count. The results from refined models are presented for all six counts.
The present study uses the best
refinement suggested by Zhong et al. (2002)
to develop genetically designed models based on data from before and after
failure. Factor models were used as benchmark models. ARIMA models were also
developed for comparison purposes. The results for the commuter count
C002181t are used here for the illustration.
Factor
models were used as benchmark models, since they typically represent current
practices. These models included average-history models, factor-history models,
single-hour monthly factor models, and day-hour monthly factor models. Both
yearly and seasonal factor models were developed. Table 2 shows the errors of
four factor models for updating missing values in July and August 2000 for
C002181t. Average errors for average-history models are usually more than 10%.
For factor-history and single-hour monthly factor models, the average errors
are usually between 5% and 10%. Most average errors for day-hour monthly factor
models are less than 7% and nearly half of them are below 5%. Most 95th
percentile errors for average-history models are more than 20%. For
factor-history models, half of the 95th percentile errors are more
than 20%. Most 95th percentile errors for Single-hour monthly factor
model are less than 20%. The day-hour monthly factor model has most accurate
results. Most 95th percentile errors are less than 12%.
Time
series analysis models were also developed for comparison. Two types of
autoregressive integrated moving average (ARIMA) models were developed. First
type was seasonal ARIMA model, which was used to update one-week long missing
values with previous two months data. Figure 6 shows how Winters
multiplicative ARIMA model updates one-week long missing values for C002181t in
July, 2000. Average errors for ARIMA model are usually between 10% and 15%. The
95th percentile errors are less than 30% for 7 out of 24 hours.
July-August
day-hour ARIMA models for updating 12-hour missing values on the 9th
day using previous 8 days data were also developed. Figure 7 shows that
ARIMA(0,1,1)(0,1,1) updates missing values on the 9th Wednesday in
July and August, 2000 for C002181t. Average errors are usually between 3% and
5% and the 95th percentile errors are usually between 7% to 10%.
A
set of genetically designed models was developed. For 24-hour universal models
and 12-hour universal models, GAs used same input selection schema for all
hours. There was only one set of weights or coefficients for neural network and
regression models. Universal neural network and regression models respond to
all input patterns with the same computation strategy. This led to high
prediction errors. For example, for 24-hour universal neural network model, the
highest 95th percentile error was up to 171.58%. Even the lowest 95th
percentile error was around 56%. For most cases, regression model performed
better than neural network model. However, nearly all the 95th
percentile prediction errors were higher than 25% for the 24-hour universal
regression model.
Training
patterns were further classified into more homogeneous groups for individual
hours. Most average errors for single-hour regression and neural network models
ranged from 5% to 8%. The 95th percentile errors for regression and
neural network models were usually between 15% and 20%. The maximum errors
usually occurred in the early morning and the minimum errors usually occurred
in the afternoon.
The
experience with single-hour models indicated that the observations for the same
hour vary substantially over a year. Based on this observation, single-hour
models were further split into seasonal single-hour models. Only the
observations from the same hours in certain season (e.g., July and August) were
used to develop models. As expected, July-August single-hour models
outperformed yearly single-hour models. Average errors for July-August single-hour
models were usually between 3% and 6%. The 50th percentile errors
for regression and neural network models were usually below 5%. All the 95th
percentile errors for July-August single-hour models were lower than yearly
single-hour models.
Since
commuter travel is usually on weekdays and recreational travel on weekends,
days of the week may also have significant impact on trip patterns for certain
roads. In order to further improve models performance, day-hour models were
developed for each type of road. The observations from same hours on the same
day of the week (e.g., 7:00-8:00 a.m. on all Wednesdays) in a year were used to
develop models. Average errors for regression day-hour models were below 6%.
Most average errors for neural network day-hour models were below 10%. The 95th
percentile errors for day-hour models were lower than July-August single-hour
models for only a few hours. But dramatic error decreases from first
single-hour model (for 7:00-8:00 a.m.) indicated that considering day of the week
in developing models would improve the accuracy.
Previous
experiments indicated that both seasons and weekdays have large impact on
travel patterns. Seasonal day-hour models resulted in even better performance.
Table 3 shows the errors of July-August day-hour models from Zhong et al.s study (2002). The models were
used to update 12 successive missing values on the Wednesdays for C002181t. The
average and the 50th percentile errors for regression models were
usually below 1%. Even the 95th percentile errors were lower than
4%, and most of them were less than 2%. Genetically designed day-hour neural
network models produced less accurate results than regression models. The
average errors and the 50th percentile errors ranged from 2% to 8%.
The 95th percentile errors for most hours were lower than 11%.
It
would be interesting to compare the performance of Zhong et al.s (2002) short-term prediction models with models based on
data from before and after the failure (hereafter referred as both-side
models). The candidate input domain of the models based on data from before and
after the failure was extended to two-week long hourly volumes. That is,
one-week long hourly volumes before the failure and one-week long hourly
volumes after the failure were used as candidate input set. Totally there were
2 ΄ 7 ΄ 24 = 336 variables in the
candidate input set. Both-side July-August day-hour models were developed for
updating 12 successive missing values between 8:00 a.m. and 8:00 p.m.
As
expected, both-side seasonal day-hour models had better performance than
short-term prediction models. Tables 4-9 show the errors of both-side
July-August day-hour models for updating 12 successive missing values on the
Wednesdays for six traffic counts. For C002181t, average errors for regression
models are less than 1%, and those for neural network models are between 3% and
7%, as shown in Table 4. All the 95th percentile errors for
regression models are less than 2%. The 95th percentile errors for
neural network models are less than 10%.
Table
5 shows errors for GAs designed both-side July-August day-hour models for
updating 12 successive missing values for C011145t. The average and the 50th
percentile errors for regression models are usually below 3%. The 95th
percentile errors are lower than 8%, and most of them are less than 5%. GAs
designed day-hour neural network models produced less accurate results. The
average errors range from 4% to 15%. The 95th percentile errors for
most hours are more than 10%. For two regional commuter counts C003061t and
C022161t, regression model outperformed neural network models for all the
hours, as shown in Tables 6 and 7. The average errors for regression models
usually range from 1% to 3%. Most 95th percentile errors are lower
than 4%. For neural network models, most average errors are between 3% and 8%,
and most 95th percentile errors range from 6% to 14%.
Models
performance deteriorates as traffic patterns become unstable. For example, the
models for the commuter count tend to have lower errors than those for the
recreational count. However, both-side July-August day-hour regression models
still performed reasonably well for rural long-distance count C001025t, as
shown in Table 8. The average errors and the 50th percentile errors
for regression models are all below 2%. All the 95th percentile
errors for regression models are lower than 4%. The average errors for
both-side July-August day-hour neural network models range from 3% to 6%. Most
95th percentile errors for neural network models are below 12%.
GAs
designed both-side July-August day-hour regression models also performed well
for updating missing values from recreational count C093001t. As shown in
Table 9, the average errors for regression models are below 5%, and the 95th
percentile errors are lower than 9% for all hours. Neural network models were
less accurate. For neural network models, the average errors are usually below
13%, and the 95th percentile errors usually range from 10% to 30%.
The
results clearly show that data from before and after failure are better
predictors than historical data alone. For example, the 95th
percentile error for estimating 6:00-7:00 p.m. traffic volume in July and
August for C011145t was 14.18% based on historical data. Using data from before
and after failure, the 95th percentile error was reduced to 4.10%.
Table
10 compares the performance of July-August day-hour models based on data from
before and after the failure with corresponding factor and ARIMA models for
C002181t. It can be found that genetically designed regression models have the
best performance. ARIMA models have a slightly better performance than neural
network models, and neural network models are superior to factor models. The
benchmark factor models have average errors of 3-7%, and the 95th
percentile errors are around 8-12%. ARIMA model has average errors between
3-5%. Most 95th percentile errors range from 6% to 10%. Genetically
designed neural network models performed reasonably well. The average errors
range from 4% to 7%, and all the 95th percentile errors are lower
than 10%. The average errors for regression models are less than 1%. Even the
95th percentile errors are below 2%.
CONCLUDING REMARKS
The principle of Base Data Integrity suggests that imputed data should
not be mixed with base data (America 1992). However, this doesnt imply that
traffic measurements can not be imputed during analysis stage. Analysis of data
from Canadian province of Alberta and Minnesota State shows that a large number
of traffic counts have missing values. It will be difficult to eliminate these
counts from traffic analysis. Under some circumstances, imputations may have to
be used for further analysis.
Literature
review indicated that simple factor or moving average regression analysis
models were used for estimating missing values by highway agencies (New Mexico
1990; FHWA 1997). Previous research (Clark 1992; Redfern et al. 1993; Watson et al.
1993) on transport related time series mainly focused on detecting missing
values or outliers. Predictions of missing values were only tested on small
pieces of time series and comprehensive statistical evaluation is not
available.
In
this study, factor models, autoregressive integrated moving average (ARIMA)
models, and genetically designed regression and neural network models were used
to estimate missing values from six PTCs in Alberta, Canada. The performance of
models was evaluated with absolute percentile errors. Factor models were used
as benchmark models. It was found that seasonal day-hour models had the best
performances for each type of model. July-August day-hour ARIMA models have the
slightly better performances than both-side July-August day-hour neural network
models. Both seasonal day-hour ARIMA and seasonal day-hour neural network
models show the superior performances than corresponding benchmark factor
models. For instance, when using seasonal day-hour neural network, ARIMA and
factor models to update missing values for C002181t (Table 10), average errors
for ARIMA and neural network models are around 4-5%. Average errors for factor
models are usually between 4% to 7%. Most 95th percentile errors for
ARIMA and neural network models are less than 10%, whereas 6 out of 12 average
errors for factor models are more than 10%. GA designed both-side seasonal
day-hour regression models have most accurate predictions among different types
of models. For example, when using July-August day-hour regression models based
on data from before and after failure to update missing values for C002181t,
the average errors are below 0.8%, and the 95th percentile errors
are lower than 1.5%.
Roads
type and functional class clearly have the influences on models performances.
Analysis indicates that roads belonging to different pattern groups and
functional classes have different short-term traffic patterns. Roads from
higher functional classes usually have more stable short-term traffic patterns.
Models usually performed better for the counts belonging to road groups with
stable patterns and higher functional classes. For example, when using
both-side July-August day-hour regression models to update missing values for
counts with stable patterns, the average errors are usually below 2%, and the
95th percentile errors are lower than 4%. However, for counts with
relatively unstable patterns (e.g., recreational count), most average errors
are below 4%, and the 95th percentile errors are lower than 9%.
The
small estimation errors from sophisticated models developed in this study
reflect the models stability and suitability for updating missing values. It
is believed that these models would be helpful for highway agencies in their
traffic data programs.
The
results presented in this study may have additional applications. Waston et al. (1993) described methods for
detecting outliers. The models presented in this study can be used to impute
more reasonable values for outliers in traffic data.
Missing
values estimation models proposed in this study provide significantly accurate
results. It would be interesting to find out the effects of these imputations
on estimations of important traffic parameters, such as AADT and DHV.
ACKNOWLEDGMENTS
The authors are grateful towards NSERC,
Canada for their financial support. The authors would also like to thank
Alberta Transportation and Minnesota Department of Transportation for the data
used in this study.
REFERENCES
Ahmed, M.S. and Cook, A.R., 1979. Analysis
of Freeway Traffic Time-Series Data by Using Box-Jenkins Techniques. Transportation Research Record 722,
Transportation Research Board, Washington D.C., pp. 1-9.
Albright, D., 1991a. History of Estimating and Evaluating
Annual Traffic Volume Statistics. Transportation Research Record 1305, Transportation Research Board,
Washington, D.C, pp. 103-107.
Albright, D., 1991b. An
Imperative for, and Current Progress toward, National Traffic Monitoring
Standards. ITE Journal, Vol. 61, No. 6, pp.
23-26.
America Association of State Highway and Transportation Officials,
1992. Guidelines for Traffic Data Programs. Washington, D.C.
America Society for Testing and Materials, 1991. Standard Practice
E1442, Highway Traffic Monitoring Standards. Philadelphia, PA.
Beveridge, S., 1992. Least Squares Estimation of Missing Values in Time
Series. Communications in Statistics: Theory and Methods, 21, no. 12.
Bole, V., Cepar, D. and Radalj, Z., 1990. Estimating Missing Values in
Time Series. Methods of Operations Research, no. 62.
Box, G. and Jenkins, J., 1970. Time Series Analysis: Forecasting
and Control. Holden-Day, San
Francisco.
Buckles, B.P. and Petry, F.E., 1994. Genetic Algorithms. IEEE Computer
Press, Los
Alamitos, California.
Chalmers, D., 1991. The evolution of learning: An
experiment in genetic connectionism. In: D. Touretzky et al. (Eds.), Connectionist Models: Proceedings of
the 1990 Summer School. San Mateo, Morgan
Kaufmann.
Chatfield, C., 1989. The Analysis of Time Series: An Introduction.
Fourth Edition, Chapman and Hall, New York.
Clark, S.D., 1992. Application of Outlier Detection and Missing Value
Replacement Techniques to Various Forms of Traffic Count Data. ITS Working
Paper 384, University of Leeds, the United Kingdom.
FHWAs Scanning Program, 1997. FHWA Study Tour for European Traffic
Monitoring Programs and Technologies.
Federal Highway Administration, U.S. Department of Transportation, Washington,
D.C.
Friedman, J.H., 1995. Intelligent Local Learning For Prediction in High
Dimensions. International Conference on Artificial Neural Networks, Paris,
France.
Garber, N.J. and Hoel, L.A., 1999. Traffic and Highway Engineering.
Second Edition, Brooks/Cole Publishing Company.
Gupta, A. and Lam, M. S., 1996. Estimating Missing Values using Neural
Networks. Journal of the Operational Research Society, 47, no. 2, pp.229-239.
Hansen, J.V., McDonald, J.B. and Nelson, R.D., 1999. Time Series Prediction with
Genetic Algorithm Designed Neural Networks: An Experimental Comparison with
Modern Statistical Models. Computational Intelligence,
15(3), pp. 171-184.
Harp,
S., Samad, T. and Guha, A., 1989. Towards the Genetic Synthesis of neural networks.
In: D. Shaffer (Ed.), Proceedings of the Third International Conference on
Genetic Algorithms. San Mateo, Morgan Kaufmann.
Hecht-Nielsen, R., 1990.
Neurocomputing. Addison-Wesley Pub. Co, Don Mills, Ontario.
Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan
Press.
Lingras, P.J. and Mountford, P., 2001. Time Delay
Neural Networks Designed Using Genetic Algorithms for Short Term Inter-City
Traffic Forecasting. Proceedings of the
Fourteenth International Conference on Industrial and Engineering Applications
of Artificial Intelligence and Expert Systems. Budapest, Hungary, pp.
292-299.
Little, R.A. and Rubin, D.B., 1987. Statistical Analysis with Missing
Data. Wiley, New York.
Locally Weighted Polynomial Regression, 2001. http://www.autonlab.org/lwpr_src/lwpr.ht ml#1.
Mendenhall, W. and Sincich, T., 1995. Statistics for Engineering and
Science. Fourth Edition,
Prentice-Hall, Inc.
Miller, G., Todd, P. and
Hedge, S., 1989. Designing neural
networks using genetic algorithms. In: D.
Shaffer (Ed.), Proceedings of the Third International Conference on Genetic
Algorithms. San Mateo, Morgan Kaufmann.
Montana, D. and Davis, L.,
1989. Training feedforward networks using genetic algorithms. In: N. Sridhara (Ed.), Proceedings of
Eleventh International Joint Conference on Artificial Intelligence, pp.
762-767.
New Mexico State Highway and Transportation
Department, 1990. 1990 Survey of Traffic Monitoring Practices among State
Transportation Agencies of the United States. Report No. FHWA-HRP-NM-90-05.
Santa Fe, New Mexico.
Nihan, N.L. and Holmesland, K.O., 1980. Use of the Box and Jenkins Time
Series Technique in Traffic Forecasting. Transportation 9, pp. 125-143.
Ramsey, B. and Hayden, G., 1994. AutoCounts: A Way to Analyse automatic
traffic count data. Traffic Engineering and Control.
Ratner, B., 2000. CHAID as a Method for Filling in Missing Values.
http://www. dmstat1.com/missing.html (to be appear in Journal of Targeting,
Measurement and Analysis for Marketing).
Redfern, E.J., Waston, S.M., Tight, M.R., and Clark, S.D., 1993. A
Comparative Assessment of Current and New Techniques for Detecting Outliers and
Estimating Missing Values in Transport Related Time Series Data. Proceedings of
Highways and Planning Summer Annual Meeting, Institute of Science and
Technology, University of Manchester, England.
Schaal, S. and Atkeson, C., 1994. Robot juggling: An implementation of
memory-based learning. Control Systems, 14, pp. 57-71.
Sharma, S.C., 1983. Minimizing Cost of Manual Traffic Counts: Canadian
Example. Transportation Research Record 905,
Transportation Research Board, Washington, D.C, pp. 1-7.
Sharma, S.C., Kilburn, P. and Wu, Y.Q., 1996. The
precision of AADT Volumes Estimates from Seasonal Traffic Counts: Alberta
Example. Canadian Journal of Civil
Engineering, Vol. 23, No. 1.
Sharma, S.C. and Werner, A., 1981. Improved
Method of Grouping Provincewide Permanent Traffic Counters. Transportation
Research Record 815, Transportation Research Board, Washington D.C., pp. 12-18.
Singh, V.P. and Harmancioglu, N.B., 1996. Estimation of Missing Values
with Use of Entropy. NATO Advanced Research Workshop, Izmir, Turkey, pp.
267-274.
Southworth, F., Chin, S. M. and Cheng, P. D., 1989. A Telemetric
Monitoring And Analysis System for Use during Large Scale Population
Evacuations. Proceedings of IEEE 2nd International Conference on
Road Traffic Monitoring, London, U.K.
Watson, S.M., Clark, S.D., Redfern, E.J. and Tight, M.R., 1993. Outlier
Detection and Missing Value Estimation in Time Series Traffic Count Data.
Proceedings of the Sixth World Conference on Transportation Research, Vol. II,
pp. 1151-1162.
Williams, B.M., Durvasula, P.K. and Brown, D.E., 1998. Urban Freeway
Traffic Flow Prediction: Application of Seasonal Autoregressive Integrated
Moving Average And Exponential Smoothing Models. Transportation Research Record
1644, Transportation Research Board, Washington, D.C., pp.132-141.
Wright, P.M., 1993. Filling in the Blanks: Multiple Imputation for
Replacing Missing Values in Survey Data. SAS 18th Annual Conference,
New York, NY.
Zhong, M., Lingras, P.J., and Sharma, S.C., 2002. Applying Short-term
Traffic Prediction Models for Updating Missing Values of Traffic Counts,
submitted to the Journal of Transportation Engineering, American Society of
Civil Engineers.
Figure 1. Time Delay Neural
Network Design
Figure 6. Updating One-week Long Missing Values in July for C002181t with Winters Multiplicative ARIMA Model
Figure 7. Using Prior 8 Wednesdays Data to Update 12-hour Missing
Values for 9th
Day in July and August for C002181t with ARIMA(0, 1, 1)(0, 1, 1) Model
Figure 6. Updating One-week Long Missing Values in July for C002181t with Winters Multiplicative ARIMA Model
Figure 7. Using Prior 8 Wednesdays Data to Update
12-hour Missing Values for 9th Day in July and August
for C002181t with ARIMA(0,
1, 1)(0, 1, 1) Model
Table 2. Comparing Four
Factor Models for Updating Missing Values in July and August for C002181t
Table 3. Errors for
Updating 12 Successive Missing Values with July-August Day-hour Models for
C002181t
Table
4. Errors for Updating 12 Successive Missing Values with Both-side July-August
Day-hour Models for C002181t
Table
5. Errors for Updating 12 Successive Missing Values with Both-side July-August
Day-hour Models for C011145t
Table
6. Errors for Updating 12 Successive Missing Values with Both-side July-August
Day-hour Models for C003061t
Table
7. Errors for Updating 12 Successive Missing Values with Both-side July-August
Day-hour Models for C022161t
Table
8. Errors for Updating 12 Successive Missing Values with Both-side July-August
Day-hour Models for C001025t
Table
9. Errors for Updating 12 Successive Missing Values with Both-side July-August
Day-hour Models for C093001t
Road
Class |
Counter Name
|
AADT
|
Functional Class
|
Training Set
|
Testing Set
|
Commuter |
C011145t |
4042 |
Minor Collector |
1996 1999 |
2000 |
|
C002181t |
41575 |
Principal Arterial |
1996 1999 |
2000 |
Regional
Commuter |
C003061t |
3580 |
Minor Collector |
1996 1999 |
2000 |
|
C022161t |
3905 |
Major Collector |
1996 1999 |
2000 |
Rural
Long-distance |
C001025t |
13627 |
Minor Arterial |
1996 1999 |
2000 |
Recreation |
C093001t |
2002 |
Major Collector |
1996 1998 |
2000 |
Table 2. Comparing Four Factor Models for Updating Missing Values in July and August for C002181t
Hour
(1) |
Prediction Errors |
|||||||||||||||
Average |
50th % |
85th % |
95th % |
|||||||||||||
A-H1 (2) |
F-H2 (3) |
S-H3 (4) |
D-H4
(5) |
A-H (6) |
F-H (7) |
S-H (8) |
D-H(9) |
A-H (10) |
F-H (11) |
S-H (12) |
D-H(13) |
A-H (14) |
F-H (15) |
S-H (16) |
D-H(17) |
|
07-08 |
17.41 |
13.26 |
12.86 |
6.99 |
9.91 |
2.79 |
7.81 |
8.01 |
13.28 |
7.70 |
12.91 |
9.45 |
45.28 |
42.06 |
24.47 |
11.58 |
08-09 |
12.84 |
8.62 |
10.04 |
7.30 |
9.42 |
2.95 |
6.66 |
6.54 |
13.31 |
9.56 |
13.53 |
11.79 |
36.04 |
30.75 |
19.91 |
15.39 |
09-10 |
11.77 |
6.27 |
6.26 |
5.44 |
10.63 |
3.62 |
4.81 |
5.12 |
15.93 |
8.52 |
10.11 |
8.73 |
26.32 |
23.16 |
21.96 |
9.18 |
10-11 |
10.97 |
4.43 |
5.19 |
4.18 |
11.27 |
3.57 |
4.44 |
3.97 |
15.88 |
8.49 |
10.21 |
6.67 |
18.88 |
11.72 |
13.96 |
9.17 |
11-12 |
10.88 |
4.65 |
5.68 |
7.00 |
10.62 |
3.45 |
4.83 |
6.71 |
16.27 |
8.11 |
10.58 |
13.01 |
19.34 |
13.04 |
13.84 |
13.75 |
12-13 |
11.66 |
5.32 |
5.58 |
6.62 |
11.73 |
3.18 |
4.84 |
7.29 |
16.02 |
9.30 |
9.39 |
10.35 |
20.99 |
14.38 |
13.97 |
11.01 |
13-14 |
10.97 |
5.97 |
5.44 |
5.74 |
10.45 |
3.21 |
3.74 |
5.18 |
16.02 |
11.33 |
10.70 |
9.64 |
20.95 |
20.25 |
14.71 |
11.09 |
14-15 |
11.68 |
6.66 |
5.22 |
2.48 |
11.10 |
3.95 |
3.71 |
1.96 |
17.06 |
9.79 |
8.89 |
4.38 |
22.91 |
25.92 |
15.67 |
5.96 |
15-16 |
12.20 |
6.12 |
5.35 |
3.65 |
11.24 |
3.81 |
3.96 |
2.90 |
18.05 |
10.58 |
10.44 |
5.38 |
20.35 |
21.93 |
13.21 |
8.75 |
16-17 |
10.64 |
4.43 |
4.95 |
3.45 |
10.61 |
3.21 |
3.81 |
2.10 |
14.92 |
8.29 |
8.38 |
6.91 |
19.23 |
10.51 |
11.38 |
7.48 |
17-18 |
9.79 |
4.25 |
4.65 |
3.04 |
9.95 |
2.41 |
3.66 |
2.58 |
12.70 |
8.96 |
8.67 |
5.21 |
16.12 |
11.62 |
12.58 |
8.02 |
18-19 |
10.23 |
6.65 |
5.99 |
6.10 |
8.84 |
4.40 |
4.70 |
5.96 |
15.15 |
10.94 |
11.78 |
9.68 |
20.49 |
19.09 |
17.08 |
11.17 |
1A-H: Average-History Model 2F-H: Factor-History Model 3S-H: Single-Hour Monthly Factor Model 4D-H: Day-hour Monthly Factor Model
Table 3. Errors for Updating 12 Successive Missing Values with
July-August Day-hour Models for C002181t
Hour (1) |
Prediction Errors |
|||||||
Average |
50th % |
85th % |
95th % |
|||||
Reg. (2) |
ANN (3) |
Reg. (4) |
ANN (5) |
Reg. (6) |
ANN (7) |
Reg. (8) |
ANN (9) |
|
07-08 |
0.25 |
2.14 |
0.21 |
1.90 |
0.39 |
3.79 |
0.57 |
4.43 |
08-09 |
0.59 |
2.19 |
0.48 |
2.38 |
0.75 |
3.16 |
1.19 |
4.28 |
09-10 |
1.66 |
3.03 |
1.14 |
3.58 |
3.19 |
4.25 |
3.47 |
6.06 |
10-11 |
0.38 |
6.71 |
0.31 |
6.50 |
0.43 |
7.33 |
1.04 |
10.48 |
11-12 |
0.56 |
2.13 |
0.50 |
1.90 |
0.74 |
3.18 |
1.03 |
4.17 |
12-13 |
0.81 |
2.90 |
0.98 |
3.26 |
1.25 |
4.54 |
1.73 |
5.67 |
13-14 |
0.59 |
2.23 |
0.56 |
2.39 |
1.02 |
3.38 |
1.07 |
4.86 |
14-15 |
0.89 |
4.69 |
0.50 |
5.54 |
1.35 |
6.63 |
2.55 |
6.98 |
15-16 |
0.68 |
6.63 |
0.88 |
6.99 |
0.97 |
10.12 |
1.09 |
10.74 |
16-17 |
0.44 |
7.82 |
0.29 |
7.93 |
0.64 |
8.97 |
1.25 |
11.82 |
17-18 |
0.31 |
6.45 |
0.31 |
6.68 |
0.42 |
8.01 |
0.68 |
10.34 |
18-19 |
0.84 |
7.10 |
0.75 |
6.67 |
1.39 |
11.93 |
1.73 |
14.00 |
Table 4. Errors for Updating 12 Successive Missing Values with Both-side
July-August Day-hour Models for C002181t
Hour (1) |
Prediction Errors |
|||||||
Average |
50th % |
85th % |
95th % |
|||||
Reg. (2) |
ANN (3) |
Reg. (4) |
ANN (5) |
Reg. (6) |
ANN (7) |
Reg. (8) |
ANN (9) |
|
07-08 |
0.40 |
6.25 |
0.44 |
5.54 |
0.52 |
8.23 |
0.76 |
9.11 |
08-09 |
0.20 |
4.03 |
0.11 |
3.45 |
0.36 |
6.75 |
0.54 |
8.14 |
09-10 |
0.70 |
5.39 |
0.36 |
5.46 |
1.53 |
8.69 |
1.54 |
9.35 |
10-11 |
0.47 |
3.38 |
0.31 |
3.19 |
0.81 |
4.58 |
1.06 |
7.89 |
11-12 |
0.42 |
3.66 |
0.41 |
2.90 |
0.77 |
6.86 |
0.85 |
7.06 |
12-13 |
0.47 |
5.56 |
0.28 |
5.66 |
0.92 |
8.52 |
1.21 |
9.29 |
13-14 |
0.40 |
4.73 |
0.44 |
4.70 |
0.64 |
6.96 |
0.81 |
7.86 |
14-15 |
0.64 |
6.20 |
0.53 |
5.77 |
1.01 |
7.80 |
1.14 |
9.00 |
15-16 |
0.45 |
4.95 |
0.35 |
5.50 |
0.82 |
6.82 |
1.04 |
7.11 |
16-17 |
0.77 |
6.78 |
0.84 |
7.24 |
1.20 |
9.55 |
1.25 |
9.62 |
17-18 |
0.78 |
4.23 |
0.75 |
3.83 |
1.41 |
5.29 |
1.52 |
8.13 |
18-19 |
0.54 |
4.49 |
0.43 |
4.04 |
1.09 |
7.73 |
1.18 |
9.00 |
Table 5. Errors for Updating 12 Successive Missing Values with Both-side
July-August Day-hour Models for C011145t
Hour (1) |
Prediction Errors |
|||||||
Average |
50th % |
85th % |
95th % |
|||||
Reg. (2) |
ANN (3) |
Reg. (4) |
ANN (5) |
Reg. (6) |
ANN (7) |
Reg. (8) |
ANN (9) |
|
07-08 |
0.68 |
6.56 |
0.53 |
6.41 |
1.10 |
9.67 |
1.38 |
13.68 |
08-09 |
2.88 |
15.30 |
2.77 |
18.55 |
4.62 |
22.96 |
5.08 |
23.76 |
09-10 |
5.06 |
11.80 |
5.23 |
6.93 |
6.48 |
20.45 |
7.21 |
24.90 |
10-11 |
1.51 |
9.01 |
1.73 |
4.70 |
2.62 |
17.72 |
2.86 |
20.50 |
11-12 |
2.50 |
7.44 |
3.03 |
5.06 |
3.56 |
12.40 |
4.46 |
15.39 |
12-13 |
2.31 |
7.97 |
2.34 |
3.99 |
2.87 |
18.46 |
3.14 |
22.92 |
13-14 |
0.75 |
4.25 |
0.74 |
2.60 |
1.27 |
8.73 |
1.70 |
9.93 |
14-15 |
1.08 |
9.14 |
1.06 |
9.73 |
1.54 |
12.39 |
2.04 |
14.67 |
15-16 |
1.10 |
6.55 |
1.05 |
3.47 |
1.96 |
10.29 |
2.48 |
17.80 |
16-17 |
3.94 |
7.81 |
3.84 |
7.38 |
5.44 |
11.96 |
6.33 |
14.58 |
17-18 |
2.13 |
6.03 |
1.93 |
4.12 |
2.21 |
9.41 |
2.91 |
12.62 |
18-19 |
2.85 |
9.47 |
3.22 |
12.02 |
3.82 |
15.70 |
4.10 |
17.98 |
Table 6. Errors for Updating 12 Successive Missing Values with Both-side
July-August Day-hour Models for C003061t
Hour (1) |
Prediction Errors |
|||||||
Average |
50th % |
85th % |
95th % |
|||||
Reg. (2) |
ANN (3) |
Reg. (4) |
ANN (5) |
Reg. (6) |
ANN (7) |
Reg. (8) |
ANN (9) |
|
07-08 |
0.66 |
6.14 |
0.58 |
4.56 |
1.09 |
11.14 |
1.15 |
13.98 |
08-09 |
1.07 |
5.37 |
0.68 |
4.43 |
1.89 |
9.46 |
2.05 |
12.12 |
09-10 |
2.66 |
2.89 |
2.09 |
2.46 |
4.76 |
4.34 |
5.50 |
6.46 |
10-11 |
0.86 |
3.93 |
0.65 |
2.96 |
1.80 |
6.48 |
1.87 |
8.43 |
11-12 |
1.12 |
4.60 |
0.76 |
3.11 |
1.63 |
6.65 |
2.84 |
13.73 |
12-13 |
1.06 |
4.27 |
1.31 |
3.70 |
1.98 |
7.76 |
2.16 |
11.21 |
13-14 |
2.20 |
8.07 |
2.59 |
8.41 |
3.40 |
11.54 |
3.64 |
13.04 |
14-15 |
1.24 |
5.31 |
1.09 |
4.50 |
2.01 |
9.09 |
2.46 |
12.26 |
15-16 |
0.47 |
4.88 |
0.37 |
3.01 |
0.77 |
7.91 |
0.88 |
11.04 |
16-17 |
1.20 |
3.75 |
1.35 |
4.41 |
1.79 |
5.67 |
2.17 |
8.12 |
17-18 |
1.25 |
2.96 |
1.41 |
3.18 |
1.74 |
4.47 |
3.02 |
6.67 |
18-19 |
1.85 |
5.22 |
0.65 |
4.23 |
5.15 |
1.03 |
5.85 |
14.19 |
Table 7. Errors for Updating 12 Successive Missing Values with Both-side
July-August Day-hour Models for C022161t
Hour (1) |
Prediction Errors |
|||||||
Average |
50th % |
85th % |
95th % |
|||||
Reg. (2) |
ANN (3) |
Reg. (4) |
ANN (5) |
Reg. (6) |
ANN (7) |
Reg. (8) |
ANN (9) |
|
07-08 |
1.20 |
8.03 |
0.86 |
7.49 |
2.00 |
13.16 |
2.51 |
17.73 |
08-09 |
1.49 |
3.55 |
1.32 |
4.01 |
2.88 |
5.95 |
3.05 |
6.42 |
09-10 |
1.24 |
8.56 |
1.23 |
5.95 |
1.91 |
15.17 |
2.23 |
22.50 |
10-11 |
1.36 |
4.62 |
1.40 |
5.45 |
196 |
6.73 |
2.83 |
6.91 |
11-12 |
0.82 |
5.80 |
0.37 |
4.71 |
1.79 |
7.21 |
2.24 |
11.86 |
12-13 |
2.11 |
4.77 |
1.74 |
2.69 |
3.08 |
9.44 |
5.66 |
10.84 |
13-14 |
0.83 |
6.06 |
0.75 |
4.80 |
1.44 |
8.92 |
1.79 |
13.30 |
14-15 |
1.54 |
5.98 |
1.66 |
5.53 |
1.81 |
9.60 |
3.06 |
14.06 |
15-16 |
1.15 |
5.12 |
0.66 |
5.55 |
1.78 |
8.25 |
3.00 |
8.96 |
16-17 |
1.91 |
5.64 |
1.55 |
5.29 |
2.79 |
7.16 |
3.69 |
8.40 |
17-18 |
0.82 |
5.25 |
0.80 |
3.86 |
1.14 |
9.71 |
1.49 |
12.22 |
18-19 |
1.00 |
3.35 |
1.14 |
2.83 |
1.71 |
5.73 |
1.82 |
6.96 |
Table 8. Errors for Updating 12 Successive Missing Values with Both-side
July-August Day-hour Models for C001025t
Hour (1) |
Prediction Errors |
|||||||
Average |
50th % |
85th % |
95th % |
|||||
Reg. (2) |
ANN (3) |
Reg. (4) |
ANN (5) |
Reg. (6) |
ANN (7) |
Reg. (8) |
ANN (9) |
|
07-08 |
0.99 |
4.78 |
0.70 |
3.79 |
2.33 |
7.18 |
2.51 |
11.47 |
08-09 |
1.19 |
3.96 |
1.27 |
4.17 |
1.75 |
5.57 |
1.91 |
7.72 |
09-10 |
0.52 |
3.74 |
0.41 |
3.66 |
0.78 |
7.74 |
1.18 |
9.18 |
10-11 |
1.25 |
3.80 |
0.79 |
3.78 |
2.57 |
5.77 |
3.49 |
7.23 |
11-12 |
0.50 |
5.40 |
0.22 |
3.05 |
0.90 |
8.80 |
1.45 |
16.08 |
12-13 |
0.31 |
3.74 |
0.30 |
3.37 |
0.46 |
6.00 |
0.58 |
7.31 |
13-14 |
0.64 |
4.09 |
0.54 |
4.76 |
1.06 |
6.13 |
1.31 |
7.74 |
14-15 |
0.32 |
5.99 |
0.18 |
6.61 |
0.49 |
9.41 |
0.91 |
11.36 |
15-16 |
1.26 |
3.90 |
1.12 |
4.20 |
1.71 |
5.82 |
3.31 |
6.10 |
16-17 |
0.74 |
3.77 |
0.55 |
3.86 |
1.28 |
6.09 |
1.54 |
6.57 |
17-18 |
0.90 |
4.99 |
0.79 |
4.38 |
1.21 |
7.54 |
1.68 |
8.25 |
18-19 |
0.51 |
4.24 |
0.51 |
3.49 |
0.64 |
7.39 |
0.85 |
9.54 |
Table 9. Errors for Updating 12 Successive Missing Values with Both-side
July-August Day-hour Models for C093001t
Hour (1) |
Prediction Errors |
|||||||
Average |
50th % |
85th % |
95th % |
|||||
Reg. (2) |
ANN (3) |
Reg. (4) |
ANN (5) |
Reg. (6) |
ANN (7) |
Reg. (8) |
ANN (9) |
|
07-08 |
2.59 |
41.61 |
1.57 |
25.99 |
5.50 |
69.14 |
6.21 |
117.41 |
08-09 |
2.01 |
12.17 |
1.48 |
10.47 |
3.86 |
15.87 |
4.21 |
26.94 |
09-10 |
3.37 |
11.18 |
1.60 |
6.08 |
5.28 |
17.36 |
8.79 |
27.68 |
10-11 |
1.98 |
12.76 |
1.82 |
11.96 |
3.52 |
25.65 |
4.28 |
30.78 |
11-12 |
0.40 |
6.66 |
0.32 |
5.92 |
0.78 |
9.95 |
0.80 |
14.46 |
12-13 |
0.76 |
6.56 |
0.45 |
7.06 |
1.43 |
10.54 |
2.04 |
11.04 |
13-14 |
0.93 |
4.06 |
0.95 |
3.55 |
1.29 |
6.48 |
1.90 |
7.72 |
14-15 |
1.11 |
5.39 |
0.89 |
4.21 |
2.06 |
8.05 |
2.42 |
11.79 |
15-16 |
2.22 |
10.65 |
2.32 |
12.82 |
3.21 |
14.32 |
3.69 |
15.40 |
16-17 |
0.54 |
7.41 |
0.25 |
6.66 |
1.19 |
11.64 |
1.23 |
15.73 |
17-18 |
4.35 |
8.39 |
4.78 |
7.88 |
6.38 |
13.83 |
8.35 |
15.76 |
18-19 |
0.98 |
9.95 |
0.97 |
12.40 |
1.05 |
15.93 |
1.77 |
17.61 |
with Corresponding Factor and ARIMA Models for C002181t
Hour
(1) |
Prediction Errors |
|||||||||||||||
Average |
50th % |
85th % |
95th % |
|||||||||||||
FACT (2) |
ARIMA (3) |
REG (4) |
ANN(5) |
FACT (6) |
ARIMA (7) |
REG (8) |
ANN(9) |
FACT (10) |
ARIMA (11) |
REG (12) |
ANN(13) |
FACT (14) |
ARIMA (15) |
REG (16) |
ANN(17) |
|
07-08 |
6.99 |
4.86 |
0.40 |
6.25 |
8.01 |
4.93 |
0.44 |
5.54 |
9.45 |
7.72 |
0.52 |
8.23 |
11.58 |
9.13 |
0.76 |
9.11 |
08-09 |
7.30 |
4.67 |
0.20 |
4.03 |
6.54 |
4.87 |
0.11 |
3.45 |
11.79 |
6.71 |
0.36 |
6.75 |
15.39 |
9.73 |
0.54 |
8.14 |
09-10 |
5.44 |
3.03 |
0.70 |
5.39 |
5.12 |
2.08 |
0.36 |
5.46 |
8.73 |
4.11 |
1.53 |
8.69 |
9.18 |
7.93 |
1.54 |
9.35 |
10-11 |
4.18 |
2.87 |
0.47 |
3.38 |
3.97 |
1.91 |
0.31 |
3.19 |
6.67 |
3.83 |
0.81 |
4.58 |
9.17 |
8.43 |
1.06 |
7.89 |
11-12 |
7.00 |
3.71 |
0.42 |
3.66 |
6.71 |
2.26 |
0.41 |
2.90 |
13.01 |
9.16 |
0.77 |
6.86 |
13.75 |
9.43 |
0.85 |
7.06 |
12-13 |
6.62 |
3.43 |
0.47 |
5.56 |
7.29 |
3.19 |
0.28 |
5.66 |
10.35 |
5.25 |
0.92 |
8.52 |
11.01 |
6.03 |
1.21 |
9.29 |
13-14 |
5.74 |
3.31 |
0.40 |
4.73 |
5.18 |
2.43 |
0.44 |
4.70 |
9.64 |
6.73 |
0.64 |
6.96 |
11.09 |
7.60 |
0.81 |
7.86 |
14-15 |
2.48 |
4.47 |
0.64 |
6.20 |
1.96 |
3.54 |
0.53 |
5.77 |
4.38 |
7.58 |
1.01 |
7.80 |
5.96 |
9.83 |
1.14 |
9.00 |
15-16 |
3.65 |
2.38 |
0.45 |
4.95 |
2.90 |
1.47 |
0.35 |
5.50 |
5.38 |
4.49 |
0.82 |
6.82 |
8.75 |
5.20 |
1.04 |
7.11 |
16-17 |
3.45 |
4.46 |
0.77 |
6.78 |
2.10 |
4.14 |
0.84 |
7.24 |
6.91 |
6.48 |
1.20 |
9.55 |
7.48 |
7.73 |
1.25 |
9.62 |
17-18 |
3.04 |
4.03 |
0.78 |
4.23 |
2.58 |
2.84 |
0.75 |
3.83 |
5.21 |
8.37 |
1.41 |
5.29 |
8.02 |
8.83 |
1.52 |
8.13 |
18-19 |
6.10 |
5.74 |
0.54 |
4.49 |
5.96 |
5.95 |
0.43 |
4.04 |
9.68 |
8.10 |
1.09 |
7.73 |
11.17 |
13.47 |
1.18 |
9.00 |