Extend Chronos evaluation to all 28 datasets from the paper

Open shchur opened this issue 3 months ago • 3 comments

Extend the benchmark suite to 28 datasets
Provide an option to run experiments in parallel on AWS Batch using Metaflow. This change is necessary because runtime of StatisticalEnsemble exceeds 24 hours for some datasets, which makes sequential evaluation infeasible.
- Instead of using relative imports, wrap the code in src/ into a package src/eval_utils that can be installed via pip using pyproject.toml.
- Add files necessary to build a Docker container used by Metaflow (Dockerfile, build_docker.sh, .dockerignore).
Add fallback model for StatsForecast to fix crashes caused by AutoARIMA model on some datasets (e.g., car_parts)
Cap context length of statistical models to last 5000 observations of each series to avoid extremely high runtimes
Update the README with full results & instructions on how to run code with Metaflow

Extended comparison of Chronos against the statistical ensemble

We present an extension to the original comparison by Nixtla of Chronos [1] against the SCUM ensemble [2]. In this analysis on over 200K unique time series across 28 datasets from Benchmark II in the Chronos paper [1], we show that zero-shot Chronos models perform comparably to this strong ensemble of 4 statistical models while being significantly faster on average. We follow the original study as closely as possible, including loading task definitions from GluonTS and computing metrics using utilsforecast.

Empirical Evaluation

This study considers over 200K unique time series from Benchmark II in the Chronos paper, spanning various time series domains, frequencies, history lengths, and prediction horizons. Chronos did not use these datasets in the training phase, so this is a zero-shot evaluation of Chronos against the statistical ensemble fitted on these datasets. We report results for two sizes of Chronos, Large and Mini, to highlight the trade-off between forecast quality and inference speed. As in the original benchmark, we have included comparisons to the seasonal naive baseline. For each model, we also report the aggregated relative score which is the geometric mean of the relative improvement over seasonal naive across datasets (see Sec. 5.4 of [1] for details).

Results

The CRPS, MASE, sMAPE, and inference time (in seconds) for each model across 28 datasets have been tabulated below. The best and second best results have been highlighted in bold and underlined. Note that the use of sMAPE is discouraged by forecasting experts and we only report it here for completeness and parity with the previous benchmark.

full_benchmark_results

Notes

The original study by Nixtla used batch_size=8 for all Chronos models. However, on the g5.2xlarge instance used in the benchmark, we can safely use batch size of 16 for Chronos (large) and batch size of 64 for Chronos (mini).
The original Nixtla benchmark re-used compiled Numba code across experiments, while this is not feasible in the current setup because of the distributed compute environment. Therefore, the reported runtime for StatisticalEnsemble is on average ~45 seconds higher than in the original benchmark. This does not affect the overall conclusions and the runtime ranking of StatisticalEnsemble and Chronos models.
Due to differences in task definitions and metric implementations, the numbers in the above table are not directly comparable with the results reported in the Chronos paper.

References

[1] Chronos: Learning the Language of Time Series
[2] A Simple Combination of Univariate Models

Apr 03 '24 14:04 shchur

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Apr 03 '24 14:04 review-notebook-app[bot]

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}