nixtla
nixtla copied to clipboard
fix: Chronos inference in foundation ts arena
Thank you for evaluating Chronos again. It's great to see it performing accurately on this benchmark as well.
We found some problems with the way inference is being done for Chronos:
- Excess
NaNpadding was being applied to short time series which is not required and would slow down the model significantly. - The original time series were being casted to
bfloat16which results in loss of information and may lead to poor accuracy.
This PR fixes these issues. The following table shows a comparison of Chronos (Large)'s performance before (taken from the original table in this repo) and after these fixes, and also reports the performance of other variants of Chronos. These experiments were performed on a g5.4xlarge instance, as in the original study.
| Accuracy | Inference Time | |||||||
|---|---|---|---|---|---|---|---|---|
| Monthly | Weekly | Daily | Hourly | Monthly | Weekly | Daily | Hourly | |
| Chronos-Large (Before) | 0.960 | 0.709 | 0.652 | 0.735 | 38.581 | 5.081 | 7.908 | 11.662 |
| Chronos-Large | 0.950 | 0.704 | 0.652 | 0.654 | 5.402 | 5.054 | 7.882 | 11.500 |
| Chronos-Base | 0.966 | 0.709 | 0.663 | 0.646 | 1.966 | 1.712 | 2.940 | 4.714 |
| Chronos-Small | 0.982 | 0.724 | 0.669 | 0.671 | 0.689 | 0.550 | 0.986 | 1.818 |
| Chronos-Mini | 0.968 | 0.736 | 0.682 | 0.729 | 0.476 | 0.356 | 0.688 | 1.371 |
| Chronos-Tiny | 0.976 | 0.765 | 0.686 | 0.799 | 0.316 | 0.212 | 0.427 | 0.965 |
We observe:
- improvements in the MASE for Monthly (~1%) and Hourly (~11%) datasets.
- a significant improvement (~38mins to ~5mins) in the inference time for the Monthly subset which has many very short time series.
- smaller Chronos models provide a quality-speed trade-off with the Base model performing almost as well as Large while being much faster, and even the mini model performing better than most baselines in the original study.
Here's how the average MASE ranking plots look like before and after the fix:
After the fix, Chronos-Large achieves the best overall rank (center plot). Chronos-Base obtains the same overall ranking as TimesFM and TimeGPT (right plot).
For the fidelity of the study, we recommend that the authors update their results and discussions accordingly, ideally after an independent verification with the latest code change (see usage below). Thank you again for your effort!
Usage
- Download data and setup environment as described here.
- Run
python eval-chronos.pyto re-evaluate (only) Chronos.