nixtla
nixtla copied to clipboard
Extend Chronos evaluation to all 28 datasets from the paper
- Extend the benchmark suite to 28 datasets
- Provide an option to run experiments in parallel on AWS Batch using Metaflow. This change is necessary because runtime of
StatisticalEnsemble
exceeds 24 hours for some datasets, which makes sequential evaluation infeasible.- Instead of using relative imports, wrap the code in
src/
into a packagesrc/eval_utils
that can be installed viapip
usingpyproject.toml
. - Add files necessary to build a Docker container used by Metaflow (
Dockerfile
,build_docker.sh
,.dockerignore
).
- Instead of using relative imports, wrap the code in
- Add fallback model for
StatsForecast
to fix crashes caused by AutoARIMA model on some datasets (e.g.,car_parts
) - Cap context length of statistical models to last 5000 observations of each series to avoid extremely high runtimes
- Update the README with full results & instructions on how to run code with Metaflow
Extended comparison of Chronos against the statistical ensemble
We present an extension to the original comparison by Nixtla of Chronos [1] against the SCUM ensemble [2]. In this analysis on over 200K unique time series across 28 datasets from Benchmark II in the Chronos paper [1], we show that zero-shot Chronos models perform comparably to this strong ensemble of 4 statistical models while being significantly faster on average. We follow the original study as closely as possible, including loading task definitions from GluonTS and computing metrics using utilsforecast.
Empirical Evaluation
This study considers over 200K unique time series from Benchmark II in the Chronos paper, spanning various time series domains, frequencies, history lengths, and prediction horizons. Chronos did not use these datasets in the training phase, so this is a zero-shot evaluation of Chronos against the statistical ensemble fitted on these datasets. We report results for two sizes of Chronos, Large and Mini, to highlight the trade-off between forecast quality and inference speed. As in the original benchmark, we have included comparisons to the seasonal naive baseline. For each model, we also report the aggregated relative score which is the geometric mean of the relative improvement over seasonal naive across datasets (see Sec. 5.4 of [1] for details).
Results
The CRPS, MASE, sMAPE, and inference time (in seconds) for each model across 28 datasets have been tabulated below. The best and second best results have been highlighted in bold and underlined. Note that the use of sMAPE is discouraged by forecasting experts and we only report it here for completeness and parity with the previous benchmark.
Notes
- The original study by Nixtla used
batch_size=8
for all Chronos models. However, on theg5.2xlarge
instance used in the benchmark, we can safely use batch size of 16 for Chronos (large) and batch size of 64 for Chronos (mini). - The original Nixtla benchmark re-used compiled Numba code across experiments, while this is not feasible in the current setup because of the distributed compute environment. Therefore, the reported runtime for
StatisticalEnsemble
is on average ~45 seconds higher than in the original benchmark. This does not affect the overall conclusions and the runtime ranking ofStatisticalEnsemble
and Chronos models. - Due to differences in task definitions and metric implementations, the numbers in the above table are not directly comparable with the results reported in the Chronos paper.
References
[1] Chronos: Learning the Language of Time Series
[2] A Simple Combination of Univariate Models
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
@shchur, any comments on our update?
Hi @mergenthaler, can you please clarify which update you are referring to?
@shchur closing this pr since it depends on https://github.com/shchur/nixtla/pull/1. feel free to reopen once we have comments on the depending pr.