Updating Missing Values of Traffic Counts: Factor Approaches, Time Series Analysis versus Genetically Designed Regression

and Neural Network Models

Ming Zhong, Pawan Lingras, and Satish Sharma

ABSTRACT: The principle of Base Data Integrity addressed by both American Association of State Highway and Transportation Officials (AASHTO) and American Society for Testing and Materials (ASTM) recommends that missing values should not be imputed in the base data. However, updating missing values may be necessary in data analysis and helpful in establishing more cost-effective traffic data programs. The analyses applied to data sets from two highway agencies show that on average over 50% permanent traffic counts (PTCs) have missing values. It will be difficult to eliminate such a significant portion of data from the analysis. Literature review indicates that the limited research uses factor or autoregressive integrated moving average (ARIMA) models for predicting missing values. Factor-based models tend to be less accurate. ARIMA models only use the historical data. In this study, genetically designed neural network and regression models, factor models, and ARIMA models were developed to update pseudo-missing values of six PTCs from Alberta, Canada. Both short-term prediction models and the models based on data from before and after the failure were developed. Factor models were used as benchmark models. It was found that genetically designed regression models based on data from before and after the failure had the most accurate results. Average errors for refined models were lower than 1% and the 95^th percentile errors were below 2% for counts with stable patterns. Even for counts with relatively unstable patterns, average errors were lower than 3% in most cases, and the 95^th percentile errors were consistently below 9%. ARIMA models and genetically designed neural network models also showed superior performance than benchmark factor models. It is believed that the models proposed in this study would be helpful for highway agencies in their traffic data programs.

Key words: Missing values, Traffic counts, Genetic algorithms, Time delay neural network, Locally weighted regression, Autoregressive integrated moving average

INTRODUCTION

Highway agencies commit a significant portion of their resources to data collection, summarization, and analysis (Sharma et al. 1996). The data is used in planning, design, control, operation, and management of traffic and highway facilities. However, the presence of missing values makes the data analysis difficult. Without proper imputation methods, traffic counts with missing values are usually discarded and new counts have to be retaken.

This study analyzed missing values for the data sets from two highway agencies in North America. First data set was from Alberta Transportation Department, and the other was from the Minnesota Department of Transportation (MnDOT). In Alberta, over seven years, more than half of total counts have missing values. During some years the percentage is as high as 70% to 90%. A year data from MnDOT shows more than 40% counts having missing values. Williams et al. (1998) applied seasonal ARIMA and exponential smoothing models to predict short-term traffic for two study sites on an urban freeway near Washington, D.C. It was reported that approximately 20 percent of the data in the development and test sets of their study were missing. Ramsey and Hayden (1994) introduced a computer program – AutoCounts used by The Countywide Traffic and Accident Data Unit (TADU) in England to process automatic traffic count data. It was found that infill models had to be used to estimate average flows for more than 50 percent of months for many years at a study site.

There are increasing concerns about data imputation and Base Data Integrity. The principle of Base Data Integrity is an important theme discussed in both American Society for Testing and Materials (ASTM) Standard Practice E1442, Highway Traffic Monitoring Standards (America 1991) and the American Association of State Highway and Transportation Officials (AASHTO) Guidelines for Traffic Data Programs (America 1992). The principle says that traffic measurements must be retained without modification and adjustment. Missing values should not be imputed in the base data. However, this does not prohibit imputing data at analysis stage. In some cases, traffic counts with missing values could be the only data available for certain purpose and data imputation is necessary for further analysis. In accordance with the principle of Truth-in-Data, AASHTO Guidelines (America 1992) also recommends highway agencies should document the procedures for editing traffic data.

For the traffic counts with missing values, highway agencies usually either retake the counts or estimate the missing values. Estimating missing values is known as data imputation. Since sometimes retaking counts was impossible due to limited resources and time, imputing the data became a popular method (Albright 1991^a). For example, it was reported that many highway agencies in the United States estimated missing values for their traffic counts (New Mexico 1990). In Europe, highway authorities in Netherlands, France, and the United Kingdom all used some computer programs for data validation routines. Usually missing or invalid data was replaced with historical data from the same site during the same period (FHWA 1997). The experience with data from Alberta Transportation also indicates that the agency used data imputation before 1995. The replaced values of missing data were marked with minus signs for some years. Imputing data with reasonable accuracy may help establish more cost-effective traffic data program. The analysis of Alberta data also shows that a significant percent (varying from 10% to 44% from year to year) of traffic counts have missing data for several successive days or months. Usually these PTCs can not be used to calculate AADT or DHV due to the missing data. Such PTCs may be used as seasonal traffic counts (STCs), short-period traffic counts (SPTCs), or just discarded by highway agencies. However, the information contained in these PTCs is certainly more than that from STCs and SPTCs. If missing data from PTCs can be accurately updated, further analysis could be applied based on AADT or DHV.

A review of literature indicates that little research has been done on missing values. Most methods used by transportation practitioners were simple factor approaches or moving average regression analyses (New Mexico 1990; FHWA 1997). Two studies (Ahmed and Cook 1979; Nihan and holmesland 1980) from the United States used Box-Jenkins techniques to predict short-term traffic for urban freeways. The models showed reasonable accuracy. These models can be used to update missing values for traffic counts. Models developed by Nihan and Holmesland (1980) were able to predict average weekday volumes for two months, in which entire monthly data was missing. A group of scholars at University of Leeds, England, tried to model outliers and missing values in traffic count time series by employing exponentially weighted moving average, autocorrelation based influence function, and autoregressive integrated moving average (ARIMA) models (Clark 1992; Redfern et al. 1993; Waston et al. 1993). It was found that ARIMA models outperformed other models in detecting missing values and outliers.

In this study, factor approaches, time series analysis, and genetically designed neural network and regression models are tested on six permanent traffic counts (PTCs) from Alberta, Canada to investigate their abilities of updating missing values. This study also compares the models based on historical data with models based on data from before and after failure. The six PTCs belong to different groups based on the trip purpose and trip length distributions. The experiments presented in this paper illustrate how to use proposed techniques to update missing values of these PTCs. The techniques used in this study could not only be applied to permanent traffic volume counts, but also to seasonal or short-term traffic volume counts, vehicle classification counts, weight counts, and speed counts.

LITERATURE REVIEW

There is significant amount of research related to missing values (Little and Rubin 1987; Bole et al. 1990; Beveridge 1992; Wright 1993; Gupta and Lam 1996; Singh and Harmancioglu 1996). However, limited research is available on how missing data are handled by transportation practitioners. Southworth et al. (1989) introduced a system called RTMAS for urban population evacuations in times of threat. One subroutine of this system is AUTOBOX, which applies Box-Jenkins time series model to the hourly or daily traffic count data. AUTOBOX allows complete autoregressive integrated moving average (ARIMA) modeling. The example in their study clearly showed that proposed ARIMA model was good at detecting unusual traffic profiles and was also good at predicting hourly counts. They used past five days data to predict 24 hourly volumes of the same day of the next week. It was found that 22 hourly volumes were within 95% confidence level of the observed counts. The other two were detected as outliers caused by an evacuation response to the threat of Hurricane Elena. Such system can also be used to predict missing values for traffic counts.

In 1990, New Mexico State Highway and Transportation conducted a survey of traffic monitoring practice (New Mexico 1990) in the United States. It was shown that when portable devices failed, 13 states used some procedure to estimate the missing values and complete the data set. When permanent devices failed, 23 states employed some procedure to estimate the missing values (Albright 1991^b). Various methods were used for this purpose. For example, in Alabama, if less than 6 hours are missing, the data are estimated using the previous year or other data from the month. If more than 6 hours are missing, the day is voided. In Delaware, estimates of missing values are based on a straight line using the data from the months before and after the failure. Most of these methods apply simple factors to historical data to estimate missing values. In Kentucky, a computer program was used to estimate and fill in the blanks (New Mexico 1990). In 1997, Federal Highway Administration (FHWA) conducted a research for traffic monitoring programs and technologies in Europe (FHWA 1997). It was reported that highway agencies in Netherlands, France, and the United Kingdom used some computer programs for data validation routines. For example, a software system INTENS was used in Netherlands for data analysis and validation. The software used a “smart” linear interpolation process between locations from which data were available to estimate missing traffic volumes. In France, a system MELODIE was used for data validation. Data validation was conducted visually by the system operator. Invalid data were replaced with previous month’s data. Several data validation schemes were used in the United Kingdom. One of them was used by Central Transport Group (CTG) to validate permanent recorder data. Invalid data were replaced with data extracted from the valid data of last week collected from that site. No research has been found for assessing the accuracy of such imputations.

A series of studies (Clark 1992; Redfern et al. 1993; Watson et al. 1993) were carried out by a group of scholars at University of Leeds, England, in the early 1990’s. Redfern et al. (1993) tested four types of models on four traffic time series supplied by Department of Transportation (DOT) in London. These models were exponentially weighted moving average, autocorrelation based influence function, ARIMA model using large residuals, and ARIMA model using the Tsay likelihood ratio diagnostics. It was reported that the estimation of replacement values for both extreme and missing values was most efficiently done using the parametric ARIMA(1,0,0)(0,1,1)₇ model. However, it was also reported that the estimated replacements of the missing values showed considerable variation (Redfern et al. 1993). The study also mentioned concerns about the Base Data Integrity.

A survey of practical solutions used by consultancies and local authorities in England (Redfern et al. 1993) reported that there were two broad categories of solutions. One is “by-eye” method and the other is computerized packages (Redfern et al. 1993). Most automated practical solutions to patching were based upon simple, moving or exponentially weighted moving average, or their variants. For example, DOT in London employed an exponentially weighted moving average model to update missing values. The process involved validating new traffic count data against old data from the same site collected over the previous weeks at the same time. Following equation was used to estimate missing or rejected data, , at time t:

(1)

where x_t-1,s, x_t-2,s,…, x_t-n,s represent the observations for that particular site and vehicle category at the same times for weeks 1, 2, …, n before the current observation; is a constant such that 0<<1. A value of 0.7 was typically used for parameter .

The Countywide Traffic and Accident Data Unit (TADU) used AutoCounts to validate collected data and infill missing values from automatic traffic counts (Ramsey and Hayden 1994). The agency needs monthly five- and seven-day flow averages for trend and yearly analysis within AutoCounts. Usually these statistics can be obtained directly from the validated data that have been flagged as typical. However, when there are no typical data the infill model is applied. The model estimates weekly flows, and starts with a seasonal profile where all weeks are considered to be equal. Then, considering the data in ascending order of age by year, the profile is modified each year. As a starting point the previous year’s profile is calculated as follows:

(2)

The model is applied on a week-to-week basis for: w = 1 to 53 and = w +1. Here FW_w is the actual weekly flow for the week w; f_w is the estimated weekly seasonal factor for week w; f₄₂ (w = 42), mid-October, is always 1.0. The model is applied iteratively either a maximum of 50 times or until no improvement in fit is achieved. The output of the process is a full 53-week flow profile for the year under consideration. No evaluations were made on the accuracy of such models (Ramsey and Hayden 1994).

REVIEW OF TECHNIQUES

This section provides a brief review of factor approaches, time series analysis, regression analysis, neural networks, and genetic algorithms used in the present study.

Factor Approaches

Factor approaches may be the most popular data imputation or prediction methods. Factor approaches usually involve developing a set of factors from historical data set and then applying these factors to new data for predictions. For example, a set of hourly factors (HF), daily factors (DF), and monthly factors (MF) can be developed based on data from permanent traffic counts. Traffic parameters, such as AADT and DHV, then could be predicted by applying these factors to short-period traffic counts (Garber and Hoel 1999). The virtue of such methods is their simplicity. However, the results are usually less accurate than more sophisticated models.

Time Series Analysis using ARIMA

A time series is a chronological sequence of observations on a particular variable. Time series data are often examined in hopes of discovering a historical pattern that can be exploited in the forecast. Time series modeling is based on the assumption that the historical values of a variable provide an indication of its value in the future (Box and Jenkins 1970).

Many techniques are available for modeling univariate time series, such as exponential smoothing, Holt-Winters procedure, and Box-Jenkins procedure. Exponential smoothing should only be used for non-seasonal time series showing little visible trend. Exponential smoothing may easily be generalized to deal with time series containing trend and seasonal variation. The resulting procedure is usually referred to as the Holt-Winters procedure. Box-Jenkins procedure is the most popular tool for time series analysis. The procedure builds an autoregressive integrated moving average (ARIMA) model using the Box-Jenkins methodology. Both autoregressive and moving average components are considered in these models. Such a model is called an integrated model because the stationary model that is fitted to the differenced data has to be summed or integrated to provide a model for the non-stationary data (Chatfield 1989). The general autoregressive integrated moving average process is of the form:

Given (3)

(4)

Where: X_t is a non-stationary process;

W_t is a stationary process;

Ñ is the differencing operator;

Z_t is white noise;

a_t and b_t are constants;

p, d, q are the order of autoregressive, differencing, and moving average components.

The above ARIMA process describing the d^th differences of the data is said to be of order (p, d, q), usually referred to as ARIMA(p, d, q). An ARIMA model considering seasonality in the data is often represented by ARIMA(p, d, q)(P, D, Q)_s, where P, D, and Q are the order of seasonal autoregressive, differencing, and moving average components; s is a seasonal periodic component that repeats every s observations.

Locally Weighted Regression Analysis

A variant of regression analysis called locally weighted regression was used in this study. Locally weighted regression is a form of instance-based (or memory-based) algorithm for learning continuous mappings from real-valued input vectors to real-valued output vectors. Local methods assign a weight to each training observation that regulates its influence on the training process. The weight depends upon the location of the training point in the input variable space relative to that of the point to be predicted. Training observations closer to the prediction point generally receive higher weights (Friedman 1995). The local weighted regression program used in this study can be downloaded from a web site (Locally 2001).

Model-based methods, such as neural networks and the mixture of Gaussians, use the data to build a parameterized model. After training, the model is used for predictions and the data are generally discarded. In contrast, “memory-based” methods are non-parametric approaches that explicitly retain the training data, and use it each time a prediction needs to be made. Locally weighted regression is a memory-based method that performs regression around a point of interest using only training data that are “local” to that point. One recent study demonstrated that locally weighted regression was suitable for real-time control by constructing a locally-weighted-regression-based system that learned a difficult juggling task (Schaal and Atkeson 1994).

Time Delay Neural Networks

The neural networks used in this study consist of three layers: input, hidden, and output. The input layer receives data from the outside world. The input layer neurons send information to the hidden layer neurons. The hidden neurons are all the neurons between the input and output layers. They are part of the internal abstract pattern, which represents the neural network’s solution to the problem. The hidden layer neurons feed their output to the output layer neurons, which provide the neural network’s response to the input data.

The variant of neural network used in this study is called time delay neural network (TDNN) (Hecht-Nielsen 1990). Figure 1 shows an example of a TDNN, which are particularly useful for time series analysis. The neurons in a given layer can receive delayed input from other neurons in the same layer. For example, the network in Figure 1 receives a single input from the external environment. The remaining nodes in the input layer get their input from the neuron on the left delayed by one time interval. The input layer at any time will hold a part of the time series. Such delays can also be incorporated in other layers.

Neurons process input and produce output. Each neuron takes in the output from many other neurons. Actual output from a neuron is calculated using a transfer function. In this study, a sigmoid transfer function is chosen because it produces a continuous value in the range [0,1]. A neuron in a given layer is connected to neurons (n₁,n₂…n_m) in the previous layer. The connection from to has the weight . The weights of the connections are initially assigned an arbitrary value between 0 and 1. The appropriate weights are determined during the training phase. Input to the is obtained using the following equation:

(5)

Output from the is calculated using a sigmoid transfer function as:

(6)

It is necessary to train a neural network model on a set of examples called the training set so that it adapts to the system it is trying to simulate. Supervised learning is the most common form of adaptation. In supervised learning, the correct output for the output layer is known. Output neurons are told what the ideal response to input signals should be. In the training phase, the network constructs an internal representation that captures the regularities of the data in a distributed and generalized way. The network attempts to adjust the weights of connections between neurons to produce the desired output. The back-propagation method is used to adjust the weights, in which errors from the output are fed back through the network, altering weights as it goes, to prevent the repetition of the error.

Genetic Algorithms

The origin of genetic algorithms (GAs) is attributed to Holland’s work (Holland 1975) on cellular automata. There has been significant interest in GAs over the last two decades (Buckles and Petry 1994). The genetic algorithm is a model of machine learning, which derives its behavior from a metaphor of the processes of evolution in nature. This is done by the creation within a machine of a population of individuals represented by chromosomes, in essence a set of character strings that are analogous to the base-4 chromosomes in human DNA. The individuals in the population then go through a process of evolution.

In practice, the evolutionary model of computation can be implemented by having arrays of bits or characters to represent the chromosomes , where c_i is called a gene. Simple bit manipulation operations allow the implementation of crossover, mutation and other operations. When genetic algorithms are implemented, they are usually done in a manner that involves the following cycle: “Evaluate the fitness of all of the individuals in the population; Create a new population by performing operations such as crossover, fitness-proportionate reproduction and mutation on the individuals whose fitness has just been measured; Discard the old population and iterate using the new population.”

The first generation (generation 0) of this process operates on a population of randomly generated individuals. From there on, the genetic operations, in concert with the fitness measure, operate to improve the population.

Genetic Algorithms for Designing Neural Networks

Many researchers have used GAs to determine neural network architectures. Harp, et al. (1989) and Miller, et al. (1989) used GAs to determine the best connections among network units. Montana and Davis (1989) used GAs for training the neural networks. Chalmers (1991) developed learning rules for neural networks using GAs.

Hansen, et al. (1999) used GAs to design time delay neural networks (TDNN), which included the determination of important features such as number of inputs, the number of hidden layers, and the number of hidden neurons in each hidden layer. Hansen, et al. (1999) applied their networks to model chemical process concentration, chemical process temperatures, and Wolfer sunspot numbers. Their results clearly showed advantages of using TDNN configured by GAs over other techniques including conventional autoregressive integrated moving average (ARIMA) methodology as described by Box and Jenkins (1970).

Hansen et al.’s approach (1999) consisted of building neural networks based on the architectures indicated by the fittest chromosome. The objective of the evolution was to minimize the training error. Such an approach is computationally expensive. Another possibility that is used in this study is to choose the architecture of the input layer using genetic algorithms.

Lingras and Mountford (2001) proposed the maximization of linear correlation between input variables and the output variable as the objective for selecting the connections between input and hidden layers. Since such an optimization is not computationally feasible for large input layers, GAs were used to search for a near optimal solution. It should be noted here that since the input layer has a section of time series, it is not possible to eliminate intermediate input neurons. They are necessary to preserve their time delay connections. However, it is possible to eliminate their feedforward connections. Lingras and Mountford (2001) achieved superior performance using the GAs designed neural networks for the prediction of inter-city traffic. The present study uses the same objective function for development of regression and neural network models. The developed models were used to update missing values of traffic counts.

STUDY DATA

Currently, Alberta Transportation employs about 350 permanent traffic counters (PTCs) to monitor its highway networks. Hierarchical grouping method proposed by Sharma and Werner (1981) was used to classify these PTCs into groups. The ratios of monthly average daily traffic (MADT) to annual average daily traffic (AADT) (known as monthly factor MF = MADT/AADT) were used to represent the highway sections monitored by these PTCs during the classification. After studying group patterns from 1996 to 2000, five groups seemed appropriate to represent study data. These groups are labeled as commuter, regional commuter, rural long-distance, summer recreational, and winter recreational groups. Figure 2 shows the grouping results. It can be seen that commuter group has a flat yearly pattern due to stable traffic flows across the year. Regional commuter and rural long-distance groups show higher peaks in the summer and lower troughs in the winter. Recreational group has the sharpest pattern and highest peak in the summer. The largest monthly factor (in August) is about 6 times the smallest monthly factor (in January) for recreational group. Winter recreational group shows an interesting yearly pattern – the peak occurred in winter season (from December to March).

Six counts were selected from various groups: two from the commuter group, two from the regional commuter group, one from the rural long-distance, and one from the recreational group. Due to insufficient data in winter recreational group, no counts were selected from that group. Table 1 shows PTCs selected from different groups, their functional classes, AADT values, and training and test data used in this study.

Figure 3 shows daily patterns for these counts. For commuter group counts (C011145t and C002181t) there are two peaks in a day: one is in the morning, and the other is in the afternoon. Regional commuter count – C022161t also has two peaks in a day, but they are smaller than commuter counts. Even though C003061t was classified into regional commuter group based on its yearly pattern, its daily pattern is very similar to that of rural long-distance count C001025t. The daily patterns of both C003061t and C001025t have two very small peaks. However, the first peak occurred nearly at noon, instead of in the early morning. Recreational count – C093001t only has one peak occurring nearly at noon. Majority of recreational travel took place in a few hours in the afternoon.

For each count, four or five years data was used in the experiments, as shown in Table 1. Five years data was used for counts from the groups other than recreational. Since there were large number of missing values in 1999 data for C093001t, only four years data is available. There are no missing values in the experimental data. The data is in the form of hourly traffic volumes for both directions.

STUDY MODELS, RESULTS, AND DISCUSSION

STUDY MODELS

The models were trained and tested by assuming that a certain portion of the data was missing. Various models were applied to estimate missing values from six PTCs. This section gives a brief description of the models developed in this study.

Genetically Designed Regression and Neural Network Models

Two types of genetically designed models were developed in this study. First type consisted of short-term prediction models, which only used the data before the failure as the input. For this type of models, one-week long (7 ´ 24 = 168) hourly volumes before the first missing value were used as the candidate inputs. The second type of models used the data from before and after the failure as the input. For models that used the data from before and after the failure, a week-long hourly volumes from each side of the occurrence of missing value(s) were used as the candidate inputs. Totally 168 ´ 2 = 336 hourly volumes were presented to GAs for selecting 24 final inputs.

Genetically designed regression and neural network models were applied to estimate missing values from traffic counts. If only one hourly volume was missing, models were only applied once to update that missing value. If there were more than one successive missing values, models were applied recursively for estimating missing values. Figure 4 shows the prototype of models used in this study.

First, assuming there was one or more than one successive missing values in traffic counts, candidate inputs of models were presented to GAs for selecting 24 final input variables. These 24 hourly volumes were chosen because they have the maximum correlation with the traffic volume of next hour, among all the combinations of 24 variables from candidate inputs. The next hour here is the hour whose volume will be predicted based on GAs selected 24 inputs. The GAs selected variables were used to train the neural network and regression models for traffic prediction of the next hour. The trained neural network or regression models were used to estimate missing traffic volume of first hour P1. If there were more than one successive missing values, same techniques would be used to predict second missing value P2. However, at this stage, the candidate pattern presented to GAs for selecting final inputs included estimated volume of first hour P1, as shown in Figure 4. P1 may or may not be chosen as final input because there are different input selection schemas for different hourly models. Figure 5 shows a TDNN model with inputs selected from a week-long hourly-volume time series. Corresponding regression model also used same inputs for prediction.

A top-down model design (Zhong et al. 2002) was used to search the models with reasonable accuracy. First 24-hour universal models were established to test their ability, then they were further split into 12-hour universal models, single-hour models, seasonal single-hour models, day-hour models, and seasonal day-hour models. The characteristics of these models are as follows:

1. 24-hour universal models: This approach involved a single TDNN and a single regression model for all the hours across the year. The advantage of such models is the simplicity of implementation during the operational phase.

2. 12-hour universal models: In order to improve models’ accuracy, 12-hour universal models were built based on 12-hour (from 8:00 a.m. to 8:00 p.m.) observations. In other words, the observations from late evenings and early mornings were eliminated from the models.

3. Single-hour models: 12-hour universal models were further split into 12 single-hour models. That is, every hour had a separate model.

4. Seasonal single-hour models: Seasons have definite impact on the travel. So further dividing single-hour models into seasonal models may improve models’ accuracy. Yearly single-hour models were further split into May-August single-hour models and July-August single-hour models.

5. Day-hour models: Travel patterns also vary by the day of the week. Further classification of observations into groups for different days (e.g., Wednesday or Saturday) may improve models’ accuracy.

6. Seasonal day-hour models. Day-hour models were further split into seasonal day-hour models (e.g., July-August day-hour models) to explore the models with higher accuracy.

The model refinement studied by Zhong et al. (2002) was further extended in this study. Genetic algorithms were used to identify 24 variables from the input time series that have the highest linear correlation with the output variable. Depending upon their position in the input time series, the candidate-input variables were labeled from 1 to 168 for models that used historical data, or 1 to 336 for models based on data from before and after failure. Each gene was allowed to take a value from 1 to 168, or 336. Each chromosome had 24 genes. The chromosomes with higher values of linear correlation were selected for creating the next generation. The population size was set to 110. The genetic algorithms were allowed to evolve for 1000 generations. The crossover rate was set at 90%, and the probability of mutation was set to 1%. The best chromosome from 1 to 1000 generations was used as the final solution of the search. The connections selected by the genetic algorithms were used to design and implement the neural network and regression models. That is, the coefficients or weights of the 24 final selected input variables were nonzero and the coefficients or weights of all the other input variables in the time series were set to zeroes in the regression or neural network models.

Factor Models

Various factor models were developed for the six experimental counts. These models were:

1. Average-history model: Historical data of the same hour from training set were averaged to update missing values in the test set. It is assumed that there is no change in traffic volume from year to year. The replacement values of missing data were calculated as:

. (7)

Where: N is the number of years in training set.

2. Factor-history models: This model added growth factor to equation (7). The replacement values were calculated as weighted average of historical data. Growth factors were used as weights in the calculations. The growth factors (GF) were calculated by using the ratio of AADT of the test year and AADTs of years in the training set. The replacement values of missing data were calculated as:

(8)

. (9)

Where: AADTⁱ_training is AADT of year i in training set;

AADT_test is AADT of the test year;

N is the number of years in training set.

3. Single-hour monthly factor models: Monthly factors from the training set were used in test set to calculate the replacement values. The monthly factors from the months before and after the failure month were used in the calculations as:

(10)

Where: mf_i is average monthly factor of the failure month calculated from training set;

mf_i-1 and mf_i+1 are average monthly factors of the month before and after the failure month respectively, which are also calculated from training set;

Value_i-1 and Value_i+1 are hourly volumes of the same hour in the months before and after the failure month respectively.

4. Day-hour monthly factor models: The previous model does not incorporate weekday impact. The day-hour monthly factor models used the same equation as single-hour monthly factor models. However, Value_i-1 and Value_i+1 were hourly volumes from the same hour and day of the week.

Autoregressive Integrated Moving Average (ARIMA) Models

Two types of ARIMA models were developed in this study. The characteristics of these models were as follows:

1. Seasonal ARIMA models for updating one-week long missing values: Previous 8 weeks data were used to estimate 168 missing hourly volumes of the 9^th week.

2. July-August day-hour ARIMA models for updating 12 successive hour missing values: Previous 8 days (same day of the week) data were used to predict 12 missing hourly volumes of the 9^th day (from 8:00 a.m. to 8:00 p.m.).

All the models were trained and tested. Depending on the model, the number of patterns or observations varied. The absolute percentage error was calculated as:

(11)

The key evaluation parameters consisted of the average, 50^th, 85^th and 95^th absolute percentile errors. These statistics give a clear profile of model’s error distribution.

RESULTS AND DISCUSSION

Various models described in previous section were tested on the data from six PTCs. This paper only presents some of the key results for illustration. For example, the results from initial models with low accuracy are only summarized for one count. The results from refined models are presented for all six counts.

The present study uses the best refinement suggested by Zhong et al. (2002) to develop genetically designed models based on data from before and after failure. Factor models were used as benchmark models. ARIMA models were also developed for comparison purposes. The results for the commuter count – C002181t are used here for the illustration.

Factor models were used as benchmark models, since they typically represent current practices. These models included average-history models, factor-history models, single-hour monthly factor models, and day-hour monthly factor models. Both yearly and seasonal factor models were developed. Table 2 shows the errors of four factor models for updating missing values in July and August 2000 for C002181t. Average errors for average-history models are usually more than 10%. For factor-history and single-hour monthly factor models, the average errors are usually between 5% and 10%. Most average errors for day-hour monthly factor models are less than 7% and nearly half of them are below 5%. Most 95^th percentile errors for average-history models are more than 20%. For factor-history models, half of the 95^th percentile errors are more than 20%. Most 95^th percentile errors for Single-hour monthly factor model are less than 20%. The day-hour monthly factor model has most accurate results. Most 95^th percentile errors are less than 12%.

Time series analysis models were also developed for comparison. Two types of autoregressive integrated moving average (ARIMA) models were developed. First type was seasonal ARIMA model, which was used to update one-week long missing values with previous two months data. Figure 6 shows how Winter’s multiplicative ARIMA model updates one-week long missing values for C002181t in July, 2000. Average errors for ARIMA model are usually between 10% and 15%. The 95^th percentile errors are less than 30% for 7 out of 24 hours.

July-August day-hour ARIMA models for updating 12-hour missing values on the 9^th day using previous 8 days data were also developed. Figure 7 shows that ARIMA(0,1,1)(0,1,1) updates missing values on the 9^th Wednesday in July and August, 2000 for C002181t. Average errors are usually between 3% and 5% and the 95^th percentile errors are usually between 7% to 10%.

A set of genetically designed models was developed. For 24-hour universal models and 12-hour universal models, GAs used same input selection schema for all hours. There was only one set of weights or coefficients for neural network and regression models. Universal neural network and regression models respond to all input patterns with the same computation strategy. This led to high prediction errors. For example, for 24-hour universal neural network model, the highest 95^th percentile error was up to 171.58%. Even the lowest 95^th percentile error was around 56%. For most cases, regression model performed better than neural network model. However, nearly all the 95^th percentile prediction errors were higher than 25% for the 24-hour universal regression model.

Training patterns were further classified into more homogeneous groups for individual hours. Most average errors for single-hour regression and neural network models ranged from 5% to 8%. The 95^th percentile errors for regression and neural network models were usually between 15% and 20%. The maximum errors usually occurred in the early morning and the minimum errors usually occurred in the afternoon.

The experience with single-hour models indicated that the observations for the same hour vary substantially over a year. Based on this observation, single-hour models were further split into seasonal single-hour models. Only the observations from the same hours in certain season (e.g., July and August) were used to develop models. As expected, July-August single-hour models outperformed yearly single-hour models. Average errors for July-August single-hour models were usually between 3% and 6%. The 50^th percentile errors for regression and neural network models were usually below 5%. All the 95^th percentile errors for July-August single-hour models were lower than yearly single-hour models.

Since commuter travel is usually on weekdays and recreational travel on weekends, days of the week may also have significant impact on trip patterns for certain roads. In order to further improve models’ performance, day-hour models were developed for each type of road. The observations from same hours on the same day of the week (e.g., 7:00-8:00 a.m. on all Wednesdays) in a year were used to develop models. Average errors for regression day-hour models were below 6%. Most average errors for neural network day-hour models were below 10%. The 95^th percentile errors for day-hour models were lower than July-August single-hour models for only a few hours. But dramatic error decreases from first single-hour model (for 7:00-8:00 a.m.) indicated that considering day of the week in developing models would improve the accuracy.

Previous experiments indicated that both seasons and weekdays have large impact on travel patterns. Seasonal day-hour models resulted in even better performance. Table 3 shows the errors of July-August day-hour models from Zhong et al.’s study (2002). The models were used to update 12 successive missing values on the Wednesdays for C002181t. The average and the 50^th percentile errors for regression models were usually below 1%. Even the 95^th percentile errors were lower than 4%, and most of them were less than 2%. Genetically designed day-hour neural network models produced less accurate results than regression models. The average errors and the 50^th percentile errors ranged from 2% to 8%. The 95^th percentile errors for most hours were lower than 11%.

It would be interesting to compare the performance of Zhong et al.’s (2002) short-term prediction models with models based on data from before and after the failure (hereafter referred as both-side models). The candidate input domain of the models based on data from before and after the failure was extended to two-week long hourly volumes. That is, one-week long hourly volumes before the failure and one-week long hourly volumes after the failure were used as candidate input set. Totally there were 2 ´ 7 ´ 24 = 336 variables in the candidate input set. Both-side July-August day-hour models were developed for updating 12 successive missing values between 8:00 a.m. and 8:00 p.m.

As expected, both-side seasonal day-hour models had better performance than short-term prediction models. Tables 4-9 show the errors of both-side July-August day-hour models for updating 12 successive missing values on the Wednesdays for six traffic counts. For C002181t, average errors for regression models are less than 1%, and those for neural network models are between 3% and 7%, as shown in Table 4. All the 95^th percentile errors for regression models are less than 2%. The 95^th percentile errors for neural network models are less than 10%.

Table 5 shows errors for GAs designed both-side July-August day-hour models for updating 12 successive missing values for C011145t. The average and the 50^th percentile errors for regression models are usually below 3%. The 95^th percentile errors are lower than 8%, and most of them are less than 5%. GAs designed day-hour neural network models produced less accurate results. The average errors range from 4% to 15%. The 95^th percentile errors for most hours are more than 10%. For two regional commuter counts – C003061t and C022161t, regression model outperformed neural network models for all the hours, as shown in Tables 6 and 7. The average errors for regression models usually range from 1% to 3%. Most 95^th percentile errors are lower than 4%. For neural network models, most average errors are between 3% and 8%, and most 95^th percentile errors range from 6% to 14%.

Model’s performance deteriorates as traffic patterns become unstable. For example, the models for the commuter count tend to have lower errors than those for the recreational count. However, both-side July-August day-hour regression models still performed reasonably well for rural long-distance count C001025t, as shown in Table 8. The average errors and the 50^th percentile errors for regression models are all below 2%. All the 95^th percentile errors for regression models are lower than 4%. The average errors for both-side July-August day-hour neural network models range from 3% to 6%. Most 95^th percentile errors for neural network models are below 12%.

GAs designed both-side July-August day-hour regression models also performed well for updating missing values from recreational count – C093001t. As shown in Table 9, the average errors for regression models are below 5%, and the 95^th percentile errors are lower than 9% for all hours. Neural network models were less accurate. For neural network models, the average errors are usually below 13%, and the 95^th percentile errors usually range from 10% to 30%.

The results clearly show that data from before and after failure are better predictors than historical data alone. For example, the 95^th percentile error for estimating 6:00-7:00 p.m. traffic volume in July and August for C011145t was 14.18% based on historical data. Using data from before and after failure, the 95^th percentile error was reduced to 4.10%.

Table 10 compares the performance of July-August day-hour models based on data from before and after the failure with corresponding factor and ARIMA models for C002181t. It can be found that genetically designed regression models have the best performance. ARIMA models have a slightly better performance than neural network models, and neural network models are superior to factor models. The benchmark factor models have average errors of 3-7%, and the 95^th percentile errors are around 8-12%. ARIMA model has average errors between 3-5%. Most 95^th percentile errors range from 6% to 10%. Genetically designed neural network models performed reasonably well. The average errors range from 4% to 7%, and all the 95^th percentile errors are lower than 10%. The average errors for regression models are less than 1%. Even the 95^th percentile errors are below 2%.

CONCLUDING REMARKS

The principle of Base Data Integrity suggests that imputed data should not be mixed with base data (America 1992). However, this doesn’t imply that traffic measurements can not be imputed during analysis stage. Analysis of data from Canadian province of Alberta and Minnesota State shows that a large number of traffic counts have missing values. It will be difficult to eliminate these counts from traffic analysis. Under some circumstances, imputations may have to be used for further analysis.

Literature review indicated that simple factor or moving average regression analysis models were used for estimating missing values by highway agencies (New Mexico 1990; FHWA 1997). Previous research (Clark 1992; Redfern et al. 1993; Watson et al. 1993) on transport related time series mainly focused on detecting missing values or outliers. Predictions of missing values were only tested on small pieces of time series and comprehensive statistical evaluation is not available.

In this study, factor models, autoregressive integrated moving average (ARIMA) models, and genetically designed regression and neural network models were used to estimate missing values from six PTCs in Alberta, Canada. The performance of models was evaluated with absolute percentile errors. Factor models were used as benchmark models. It was found that seasonal day-hour models had the best performances for each type of model. July-August day-hour ARIMA models have the slightly better performances than both-side July-August day-hour neural network models. Both seasonal day-hour ARIMA and seasonal day-hour neural network models show the superior performances than corresponding benchmark factor models. For instance, when using seasonal day-hour neural network, ARIMA and factor models to update missing values for C002181t (Table 10), average errors for ARIMA and neural network models are around 4-5%. Average errors for factor models are usually between 4% to 7%. Most 95^th percentile errors for ARIMA and neural network models are less than 10%, whereas 6 out of 12 average errors for factor models are more than 10%. GA designed both-side seasonal day-hour regression models have most accurate predictions among different types of models. For example, when using July-August day-hour regression models based on data from before and after failure to update missing values for C002181t, the average errors are below 0.8%, and the 95^th percentile errors are lower than 1.5%.

Road’s type and functional class clearly have the influences on model’s performances. Analysis indicates that roads belonging to different pattern groups and functional classes have different short-term traffic patterns. Roads from higher functional classes usually have more stable short-term traffic patterns. Models usually performed better for the counts belonging to road groups with stable patterns and higher functional classes. For example, when using both-side July-August day-hour regression models to update missing values for counts with stable patterns, the average errors are usually below 2%, and the 95^th percentile errors are lower than 4%. However, for counts with relatively unstable patterns (e.g., recreational count), most average errors are below 4%, and the 95^th percentile errors are lower than 9%.

The small estimation errors from sophisticated models developed in this study reflect the models’ stability and suitability for updating missing values. It is believed that these models would be helpful for highway agencies in their traffic data programs.

The results presented in this study may have additional applications. Waston et al. (1993) described methods for detecting outliers. The models presented in this study can be used to impute more reasonable values for outliers in traffic data.

Missing values estimation models proposed in this study provide significantly accurate results. It would be interesting to find out the effects of these imputations on estimations of important traffic parameters, such as AADT and DHV.

ACKNOWLEDGMENTS

The authors are grateful towards NSERC, Canada for their financial support. The authors would also like to thank Alberta Transportation and Minnesota Department of Transportation for the data used in this study.

REFERENCES

Ahmed, M.S. and Cook, A.R., 1979. Analysis of Freeway Traffic Time-Series Data by Using Box-Jenkins Techniques. Transportation Research Record 722, Transportation Research Board, Washington D.C., pp. 1-9.

Albright, D., 1991^a. History of Estimating and Evaluating Annual Traffic Volume Statistics. Transportation Research Record 1305, Transportation Research Board, Washington, D.C, pp. 103-107.

Albright, D., 1991^b. An Imperative for, and Current Progress toward, National Traffic Monitoring Standards. ITE Journal, Vol. 61, No. 6, pp. 23-26.

America Association of State Highway and Transportation Officials, 1992. Guidelines for Traffic Data Programs. Washington, D.C.

America Society for Testing and Materials, 1991. Standard Practice E1442, Highway Traffic Monitoring Standards. Philadelphia, PA.

Beveridge, S., 1992. Least Squares Estimation of Missing Values in Time Series. Communications in Statistics: Theory and Methods, 21, no. 12.

Bole, V., Cepar, D. and Radalj, Z., 1990. Estimating Missing Values in Time Series. Methods of Operations Research, no. 62.

Box, G. and Jenkins, J., 1970. Time Series Analysis: Forecasting and Control. Holden-Day, San Francisco.

Buckles, B.P. and Petry, F.E., 1994. Genetic Algorithms. IEEE Computer Press, Los
Alamitos, California.

Chalmers, D., 1991. The evolution of learning: An experiment in genetic connectionism. In: D. Touretzky et al. (Eds.), Connectionist Models: Proceedings of the 1990 Summer School. San Mateo, Morgan Kaufmann.

Chatfield, C., 1989. The Analysis of Time Series: An Introduction. Fourth Edition, Chapman and Hall, New York.

Clark, S.D., 1992. Application of Outlier Detection and Missing Value Replacement Techniques to Various Forms of Traffic Count Data. ITS Working Paper 384, University of Leeds, the United Kingdom.

FHWA’s Scanning Program, 1997. FHWA Study Tour for European Traffic Monitoring Programs and Technologies. Federal Highway Administration, U.S. Department of Transportation, Washington, D.C.

Friedman, J.H., 1995. Intelligent Local Learning For Prediction in High Dimensions. International Conference on Artificial Neural Networks, Paris, France.

Garber, N.J. and Hoel, L.A., 1999. Traffic and Highway Engineering. Second Edition, Brooks/Cole Publishing Company.

Gupta, A. and Lam, M. S., 1996. Estimating Missing Values using Neural Networks. Journal of the Operational Research Society, 47, no. 2, pp.229-239.

Hansen, J.V., McDonald, J.B. and Nelson, R.D., 1999. Time Series Prediction with Genetic Algorithm Designed Neural Networks: An Experimental Comparison with Modern Statistical Models. Computational Intelligence, 15(3), pp. 171-184.

Harp, S., Samad, T. and Guha, A., 1989. Towards the Genetic Synthesis of neural networks. In: D. Shaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms. San Mateo, Morgan Kaufmann.

Hecht-Nielsen, R., 1990. Neurocomputing. Addison-Wesley Pub. Co, Don Mills, Ontario.

Holland, J.H., 1975. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press.

Lingras, P.J. and Mountford, P., 2001. Time Delay Neural Networks Designed Using Genetic Algorithms for Short Term Inter-City Traffic Forecasting. Proceedings of the Fourteenth International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems. Budapest, Hungary, pp. 292-299.

Little, R.A. and Rubin, D.B., 1987. Statistical Analysis with Missing Data. Wiley, New York.

Locally Weighted Polynomial Regression, 2001. http://www.autonlab.org/lwpr_src/lwpr.ht ml#1.

Mendenhall, W. and Sincich, T., 1995. Statistics for Engineering and Science. Fourth Edition, Prentice-Hall, Inc.

Miller, G., Todd, P. and Hedge, S., 1989. Designing neural networks using genetic algorithms. In: D. Shaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms. San Mateo, Morgan Kaufmann.

Montana, D. and Davis, L., 1989. Training feedforward networks using genetic algorithms. In: N. Sridhara (Ed.), Proceedings of Eleventh International Joint Conference on Artificial Intelligence, pp. 762-767.

New Mexico State Highway and Transportation Department, 1990. 1990 Survey of Traffic Monitoring Practices among State Transportation Agencies of the United States. Report No. FHWA-HRP-NM-90-05. Santa Fe, New Mexico.

Nihan, N.L. and Holmesland, K.O., 1980. Use of the Box and Jenkins Time Series Technique in Traffic Forecasting. Transportation 9, pp. 125-143.

Ramsey, B. and Hayden, G., 1994. AutoCounts: A Way to Analyse automatic traffic count data. Traffic Engineering and Control.

Ratner, B., 2000. CHAID as a Method for Filling in Missing Values. http://www. dmstat1.com/missing.html (to be appear in Journal of Targeting, Measurement and Analysis for Marketing).

Redfern, E.J., Waston, S.M., Tight, M.R., and Clark, S.D., 1993. A Comparative Assessment of Current and New Techniques for Detecting Outliers and Estimating Missing Values in Transport Related Time Series Data. Proceedings of Highways and Planning Summer Annual Meeting, Institute of Science and Technology, University of Manchester, England.

Schaal, S. and Atkeson, C., 1994. Robot juggling: An implementation of memory-based learning. Control Systems, 14, pp. 57-71.

Sharma, S.C., 1983. Minimizing Cost of Manual Traffic Counts: Canadian Example. Transportation Research Record 905, Transportation Research Board, Washington, D.C, pp. 1-7.

Sharma, S.C., Kilburn, P. and Wu, Y.Q., 1996. The precision of AADT Volumes Estimates from Seasonal Traffic Counts: Alberta Example. Canadian Journal of Civil Engineering, Vol. 23, No. 1.

Sharma, S.C. and Werner, A., 1981. Improved Method of Grouping Provincewide Permanent Traffic Counters. Transportation Research Record 815, Transportation Research Board, Washington D.C., pp. 12-18.

Singh, V.P. and Harmancioglu, N.B., 1996. Estimation of Missing Values with Use of Entropy. NATO Advanced Research Workshop, Izmir, Turkey, pp. 267-274.

Southworth, F., Chin, S. M. and Cheng, P. D., 1989. A Telemetric Monitoring And Analysis System for Use during Large Scale Population Evacuations. Proceedings of IEEE 2^nd International Conference on Road Traffic Monitoring, London, U.K.

Watson, S.M., Clark, S.D., Redfern, E.J. and Tight, M.R., 1993. Outlier Detection and Missing Value Estimation in Time Series Traffic Count Data. Proceedings of the Sixth World Conference on Transportation Research, Vol. II, pp. 1151-1162.

Williams, B.M., Durvasula, P.K. and Brown, D.E., 1998. Urban Freeway Traffic Flow Prediction: Application of Seasonal Autoregressive Integrated Moving Average And Exponential Smoothing Models. Transportation Research Record 1644, Transportation Research Board, Washington, D.C., pp.132-141.

Wright, P.M., 1993. Filling in the Blanks: Multiple Imputation for Replacing Missing Values in Survey Data. SAS 18^th Annual Conference, New York, NY.

Zhong, M., Lingras, P.J., and Sharma, S.C., 2002. Applying Short-term Traffic Prediction Models for Updating Missing Values of Traffic Counts, submitted to the Journal of Transportation Engineering, American Society of Civil Engineers.

List of Figures

Figure 1. Time Delay Neural Network Design

Figure 2. Hierarchical Grouping of Alberta Highway Sections

Figure 3. Daily Patterns of Six Study Counts

Figure 4. The Prototype of Updating Missing Values Model

Figure 5. TDNN Model Used for Prediction

Figure 6. Updating One-week Long Missing Values in July for C002181t with Winter’s Multiplicative ARIMA Model

Figure 7. Using Prior 8 Wednesdays Data to Update 12-hour Missing Values for 9^th

Day in July and August for C002181t with ARIMA(0, 1, 1)(0, 1, 1) Model

Figure 1. Time Delay Neural Network Design

Figure 2. Hierarchical Grouping of Alberta Highway Sections

Figure 3. Daily Patterns of Six Study Counts

Figure 4. The Prototype of Updating Missing Values Model

Figure 5. TDNN Model Used for Prediction

Table 3. Errors for Updating 12 Successive Missing Values with

July-August Day-hour Models for C002181t

Hour (1)	Prediction Errors
	Average		50^th %		85^th %		95^th %
	Reg. (2)	ANN (3)	Reg. (4)	ANN (5)	Reg. (6)	ANN (7)	Reg. (8)	ANN (9)
07-08	0.25	2.14	0.21	1.90	0.39	3.79	0.57	4.43
08-09	0.59	2.19	0.48	2.38	0.75	3.16	1.19	4.28
09-10	1.66	3.03	1.14	3.58	3.19	4.25	3.47	6.06
10-11	0.38	6.71	0.31	6.50	0.43	7.33	1.04	10.48
11-12	0.56	2.13	0.50	1.90	0.74	3.18	1.03	4.17
12-13	0.81	2.90	0.98	3.26	1.25	4.54	1.73	5.67
13-14	0.59	2.23	0.56	2.39	1.02	3.38	1.07	4.86
14-15	0.89	4.69	0.50	5.54	1.35	6.63	2.55	6.98
15-16	0.68	6.63	0.88	6.99	0.97	10.12	1.09	10.74
16-17	0.44	7.82	0.29	7.93	0.64	8.97	1.25	11.82
17-18	0.31	6.45	0.31	6.68	0.42	8.01	0.68	10.34
18-19	0.84	7.10	0.75	6.67	1.39	11.93	1.73	14.00

Table 4. Errors for Updating 12 Successive Missing Values with Both-side

July-August Day-hour Models for C002181t

Hour (1)	Prediction Errors
	Average		50^th %		85^th %		95^th %
	Reg. (2)	ANN (3)	Reg. (4)	ANN (5)	Reg. (6)	ANN (7)	Reg. (8)	ANN (9)
07-08	0.40	6.25	0.44	5.54	0.52	8.23	0.76	9.11
08-09	0.20	4.03	0.11	3.45	0.36	6.75	0.54	8.14
09-10	0.70	5.39	0.36	5.46	1.53	8.69	1.54	9.35
10-11	0.47	3.38	0.31	3.19	0.81	4.58	1.06	7.89
11-12	0.42	3.66	0.41	2.90	0.77	6.86	0.85	7.06
12-13	0.47	5.56	0.28	5.66	0.92	8.52	1.21	9.29
13-14	0.40	4.73	0.44	4.70	0.64	6.96	0.81	7.86
14-15	0.64	6.20	0.53	5.77	1.01	7.80	1.14	9.00
15-16	0.45	4.95	0.35	5.50	0.82	6.82	1.04	7.11
16-17	0.77	6.78	0.84	7.24	1.20	9.55	1.25	9.62
17-18	0.78	4.23	0.75	3.83	1.41	5.29	1.52	8.13
18-19	0.54	4.49	0.43	4.04	1.09	7.73	1.18	9.00

Table 5. Errors for Updating 12 Successive Missing Values with Both-side

July-August Day-hour Models for C011145t

Hour (1)	Prediction Errors
	Average		50^th %		85^th %		95^th %
	Reg. (2)	ANN (3)	Reg. (4)	ANN (5)	Reg. (6)	ANN (7)	Reg. (8)	ANN (9)
07-08	0.68	6.56	0.53	6.41	1.10	9.67	1.38	13.68
08-09	2.88	15.30	2.77	18.55	4.62	22.96	5.08	23.76
09-10	5.06	11.80	5.23	6.93	6.48	20.45	7.21	24.90
10-11	1.51	9.01	1.73	4.70	2.62	17.72	2.86	20.50
11-12	2.50	7.44	3.03	5.06	3.56	12.40	4.46	15.39
12-13	2.31	7.97	2.34	3.99	2.87	18.46	3.14	22.92
13-14	0.75	4.25	0.74	2.60	1.27	8.73	1.70	9.93
14-15	1.08	9.14	1.06	9.73	1.54	12.39	2.04	14.67
15-16	1.10	6.55	1.05	3.47	1.96	10.29	2.48	17.80
16-17	3.94	7.81	3.84	7.38	5.44	11.96	6.33	14.58
17-18	2.13	6.03	1.93	4.12	2.21	9.41	2.91	12.62
18-19	2.85	9.47	3.22	12.02	3.82	15.70	4.10	17.98

Table 6. Errors for Updating 12 Successive Missing Values with Both-side

July-August Day-hour Models for C003061t

Hour (1)	Prediction Errors
	Average		50^th %		85^th %		95^th %
	Reg. (2)	ANN (3)	Reg. (4)	ANN (5)	Reg. (6)	ANN (7)	Reg. (8)	ANN (9)
07-08	0.66	6.14	0.58	4.56	1.09	11.14	1.15	13.98
08-09	1.07	5.37	0.68	4.43	1.89	9.46	2.05	12.12
09-10	2.66	2.89	2.09	2.46	4.76	4.34	5.50	6.46
10-11	0.86	3.93	0.65	2.96	1.80	6.48	1.87	8.43
11-12	1.12	4.60	0.76	3.11	1.63	6.65	2.84	13.73
12-13	1.06	4.27	1.31	3.70	1.98	7.76	2.16	11.21
13-14	2.20	8.07	2.59	8.41	3.40	11.54	3.64	13.04
14-15	1.24	5.31	1.09	4.50	2.01	9.09	2.46	12.26
15-16	0.47	4.88	0.37	3.01	0.77	7.91	0.88	11.04
16-17	1.20	3.75	1.35	4.41	1.79	5.67	2.17	8.12
17-18	1.25	2.96	1.41	3.18	1.74	4.47	3.02	6.67
18-19	1.85	5.22	0.65	4.23	5.15	1.03	5.85	14.19

Table 7. Errors for Updating 12 Successive Missing Values with Both-side

July-August Day-hour Models for C022161t

Hour (1)	Prediction Errors
	Average		50^th %		85^th %		95^th %
	Reg. (2)	ANN (3)	Reg. (4)	ANN (5)	Reg. (6)	ANN (7)	Reg. (8)	ANN (9)
07-08	1.20	8.03	0.86	7.49	2.00	13.16	2.51	17.73
08-09	1.49	3.55	1.32	4.01	2.88	5.95	3.05	6.42
09-10	1.24	8.56	1.23	5.95	1.91	15.17	2.23	22.50
10-11	1.36	4.62	1.40	5.45	196	6.73	2.83	6.91
11-12	0.82	5.80	0.37	4.71	1.79	7.21	2.24	11.86
12-13	2.11	4.77	1.74	2.69	3.08	9.44	5.66	10.84
13-14	0.83	6.06	0.75	4.80	1.44	8.92	1.79	13.30
14-15	1.54	5.98	1.66	5.53	1.81	9.60	3.06	14.06
15-16	1.15	5.12	0.66	5.55	1.78	8.25	3.00	8.96
16-17	1.91	5.64	1.55	5.29	2.79	7.16	3.69	8.40
17-18	0.82	5.25	0.80	3.86	1.14	9.71	1.49	12.22
18-19	1.00	3.35	1.14	2.83	1.71	5.73	1.82	6.96

Table 8. Errors for Updating 12 Successive Missing Values with Both-side

July-August Day-hour Models for C001025t

Hour (1)	Prediction Errors
	Average		50^th %		85^th %		95^th %
	Reg. (2)	ANN (3)	Reg. (4)	ANN (5)	Reg. (6)	ANN (7)	Reg. (8)	ANN (9)
07-08	0.99	4.78	0.70	3.79	2.33	7.18	2.51	11.47
08-09	1.19	3.96	1.27	4.17	1.75	5.57	1.91	7.72
09-10	0.52	3.74	0.41	3.66	0.78	7.74	1.18	9.18
10-11	1.25	3.80	0.79	3.78	2.57	5.77	3.49	7.23
11-12	0.50	5.40	0.22	3.05	0.90	8.80	1.45	16.08
12-13	0.31	3.74	0.30	3.37	0.46	6.00	0.58	7.31
13-14	0.64	4.09	0.54	4.76	1.06	6.13	1.31	7.74
14-15	0.32	5.99	0.18	6.61	0.49	9.41	0.91	11.36
15-16	1.26	3.90	1.12	4.20	1.71	5.82	3.31	6.10
16-17	0.74	3.77	0.55	3.86	1.28	6.09	1.54	6.57
17-18	0.90	4.99	0.79	4.38	1.21	7.54	1.68	8.25
18-19	0.51	4.24	0.51	3.49	0.64	7.39	0.85	9.54

Table 9. Errors for Updating 12 Successive Missing Values with Both-side

July-August Day-hour Models for C093001t

Hour (1)	Prediction Errors
	Average		50^th %		85^th %		95^th %
	Reg. (2)	ANN (3)	Reg. (4)	ANN (5)	Reg. (6)	ANN (7)	Reg. (8)	ANN (9)
07-08	2.59	41.61	1.57	25.99	5.50	69.14	6.21	117.41
08-09	2.01	12.17	1.48	10.47	3.86	15.87	4.21	26.94
09-10	3.37	11.18	1.60	6.08	5.28	17.36	8.79	27.68
10-11	1.98	12.76	1.82	11.96	3.52	25.65	4.28	30.78
11-12	0.40	6.66	0.32	5.92	0.78	9.95	0.80	14.46
12-13	0.76	6.56	0.45	7.06	1.43	10.54	2.04	11.04
13-14	0.93	4.06	0.95	3.55	1.29	6.48	1.90	7.72
14-15	1.11	5.39	0.89	4.21	2.06	8.05	2.42	11.79
15-16	2.22	10.65	2.32	12.82	3.21	14.32	3.69	15.40
16-17	0.54	7.41	0.25	6.66	1.19	11.64	1.23	15.73
17-18	4.35	8.39	4.78	7.88	6.38	13.83	8.35	15.76
18-19	0.98	9.95	0.97	12.40	1.05	15.93	1.77	17.61

Road Class	Counter Name	AADT	Functional Class	Training Set	Testing Set
Commuter	C011145t	4042	Minor Collector	1996 – 1999	2000
	C002181t	41575	Principal Arterial	1996 – 1999	2000
Regional Commuter	C003061t	3580	Minor Collector	1996 – 1999	2000
	C022161t	3905	Major Collector	1996 – 1999	2000
Rural Long-distance	C001025t	13627	Minor Arterial	1996 – 1999	2000
Recreation	C093001t	2002	Major Collector	1996 – 1998	2000

Hour (1)	Prediction Errors
	Average				50^th %				85^th %				95^th %
	A-H¹ (2)	F-H² (3)	S-H³ (4)	D-H⁴ (5)	A-H (6)	F-H (7)	S-H (8)	D-H (9)	A-H (10)	F-H (11)	S-H (12)	D-H (13)	A-H (14)	F-H (15)	S-H (16)	D-H (17)
07-08	17.41	13.26	12.86	6.99	9.91	2.79	7.81	8.01	13.28	7.70	12.91	9.45	45.28	42.06	24.47	11.58
08-09	12.84	8.62	10.04	7.30	9.42	2.95	6.66	6.54	13.31	9.56	13.53	11.79	36.04	30.75	19.91	15.39
09-10	11.77	6.27	6.26	5.44	10.63	3.62	4.81	5.12	15.93	8.52	10.11	8.73	26.32	23.16	21.96	9.18
10-11	10.97	4.43	5.19	4.18	11.27	3.57	4.44	3.97	15.88	8.49	10.21	6.67	18.88	11.72	13.96	9.17
11-12	10.88	4.65	5.68	7.00	10.62	3.45	4.83	6.71	16.27	8.11	10.58	13.01	19.34	13.04	13.84	13.75
12-13	11.66	5.32	5.58	6.62	11.73	3.18	4.84	7.29	16.02	9.30	9.39	10.35	20.99	14.38	13.97	11.01
13-14	10.97	5.97	5.44	5.74	10.45	3.21	3.74	5.18	16.02	11.33	10.70	9.64	20.95	20.25	14.71	11.09
14-15	11.68	6.66	5.22	2.48	11.10	3.95	3.71	1.96	17.06	9.79	8.89	4.38	22.91	25.92	15.67	5.96
15-16	12.20	6.12	5.35	3.65	11.24	3.81	3.96	2.90	18.05	10.58	10.44	5.38	20.35	21.93	13.21	8.75
16-17	10.64	4.43	4.95	3.45	10.61	3.21	3.81	2.10	14.92	8.29	8.38	6.91	19.23	10.51	11.38	7.48
17-18	9.79	4.25	4.65	3.04	9.95	2.41	3.66	2.58	12.70	8.96	8.67	5.21	16.12	11.62	12.58	8.02
18-19	10.23	6.65	5.99	6.10	8.84	4.40	4.70	5.96	15.15	10.94	11.78	9.68	20.49	19.09	17.08	11.17

Updating Missing Values of Traffic Counts: Factor Approaches, Time Series Analysis versus Genetically Designed Regression

and Neural Network Models

*Corresponding Author: Pawan Lingras

Updating Missing Values of Traffic Counts: Factor Approaches, Time Series Analysis versus Genetically Designed Regression

and Neural Network Models

REVIEW OF TECHNIQUES

Factor Approaches

Time Series Analysis using ARIMA

Locally Weighted Regression Analysis

Time Delay Neural Networks

Genetic Algorithms

List of Figures

Figure 2. Hierarchical Grouping of Alberta Highway Sections

Figure 3. Daily Patterns of Six Study Counts

Figure 4. The Prototype of Updating Missing Values Model

Figure 5. TDNN Model Used for Prediction

Figure 1. Time Delay Neural Network Design

Figure 2. Hierarchical Grouping of Alberta Highway Sections

Figure 3. Daily Patterns of Six Study Counts

Figure 4. The Prototype of Updating Missing Values Model

Figure 5. TDNN Model Used for Prediction

List of Tables

Table 1. Experimental Data from Different Groups

Table 10. Comparing July-August Day-hour Models based on Data from before and after Failure with Corresponding Factor and ARIMA Models for C002181t

Table 1. Experimental Data from Different Groups

Counter Name

AADT

Functional Class

Training Set

Testing Set

Minor Collector

Hour

Prediction Errors

D-H4

(5)

D-H

(9)

D-H

(13)

D-H

(17)

Table 10. Comparing July-August Day-hour Models based on Data from before and after Failure

Hour

Prediction Errors

ANN

(5)

ANN

(9)

ANN

(13)

ANN

(17)

^*Corresponding Author: Pawan Lingras

D-H⁴