Availability at prediction time

The biggest challenge with related time series (numerical features or causals) is the availability of the data at prediction and forecasting time. Generalization refers to a model's ability to adapt properly to new, previously unseen data drawn from the same distribution as the one used to create the model. Since the goal is to identify models that can be generalized, the information to be used must always be available in the forecast horizon.

In many cases, forecasting features is even more difficult than forecasting the target of interest. This means that, in most cases, features are not available at prediction time.

One common mistake in time series forecasting is the assumption that “availability at prediction time“ is not a serious problem and that forward-looking features are always available for use. It can be tempting to simply perform future imputation by forecasting the related time series (features). However, by doing so, training is performed with actual data, and the learning is then based on patterns that cannot be replicated with accuracy in the forecast horizon. A production environment cannot be expected to reproduce the same circumstances of predictions generated in model selection or evaluation sets. Using forecasted features at prediction time damages the quality of the target forecasts, further compounding errors already in place. In the end, out of sample accuracy metrics cannot be trusted, because the models have learned something that cannot be used in the future.

For this reason, related time series datasets are mainly used as lags and/or window statistics when representative and accurate forecasts are the main objective.

Last modified: Friday May 12, 2023

Is this useful?