Using backtesting for model selection and evaluation

Although developed to generate forecasts, the Demand Modeler workflow actually generates both forecasts and predictions that are not forecasts. While determining the best model to use for generating forecasts and assessing how well that model’s forecasts are expected to perform, the app also generates predictions about the past, via backtesting.

In backtesting, a set of models are assessed to determine their respective forecasting accuracy. The backtesting procedure splits the historic data and evaluates each model’s performance. This process differs from a typical cross-validation methodology used in machine learning, since in the context of time-series forecasting there is a temporal dependency between observations, and the traditional random splitting of samples into a train and a test set can introduce data leakage.

Demand Modeler uses the Expanding Window Fold Generator as its backtesting framework. As shown in the following graph, this framework includes creating a number of equally-sized test sets known as folds to evaluate the accuracy of predictions generated by models trained on the preceding data. Across these folds, the training data size is expanded from some starting size to a maximum size. This method creates an adequate number of training-test pairs while also maximizing the amount of data the algorithms receive, providing generalized test results to guide future forecasts.

The following example illustrates how a backtesting framework can be configured using some of the options available in Demand Modeler:

The first scenario consists of two folds (Number of Folds=2) with five time steps (Window Size=5), offset by two time steps (Jump=2). No “gaps” exist between the training time steps and folds time steps (Gap=0). The folds time steps are highlighted in red, and the training time steps are highlighted in yellow.

The second scenario also consists of two folds (Number of Folds=2) with five time steps (Window Size=5), offset by 2 time steps (Jump=2). However, a one-period gap exists between the training time steps and folds time steps (Gap=1). Therefore, the unhighlighted time step preceding each fold is not used in training or evaluation.

This example shows how a set of folds of a single type can be configured. In reality, however, multiple decisions are required for an end-to-end production-grade forecasting system, such as:

  • hyper-parameter tuning

  • ensembling strategy

  • estimator selection

  • assessment of the forecast generalization error

Demand Modeler repeats the backtesting procedure multiple times to achieve its objectives in different splits of the historical data. This process is referred to as nested-cross validation, in which the time-series data is split into the following sets:

Last modified: Friday May 12, 2023

Is this useful?