darts icon indicating copy to clipboard operation
darts copied to clipboard

Nice summary table, why no benchmarking leaderboards?

Open oliverangelil opened this issue 11 months ago • 2 comments

Is your feature request related to a current problem? Please describe. There are many different models in darts, and their characteristics are compared nicely in this summary table on the README.md; however some of these models would perform better at certain prediction tasks compared to other models. How can someone know what to go with? The easy answer would be "try all and see what works best for your use case"; however that is a time consuming approach for anyone coming to this package.

Describe proposed solution How about your own benchmarking leaderboard, or contributing to another one? For example see this.

oliverangelil avatar Feb 13 '25 09:02 oliverangelil

Hi @oliverangelil,

This request is kind of tracked by #1366, and @Loudegaste actually created a lot of nice materials here.

As pointed in a lot of conversation, it is very difficult, if not impossible, to fairly benchmark models, even on a limited number of datasets as you could always further improve the feature engineering depending on the model's architecture & subtleties. In the documentation, we recommend starting with regression models and then experiment with some deep learning, since basing your decision on some arbitrary benchmark/leaderboard might not be the best option either.

We'll try to think a bit about it, the leaderboard you shared looks nice and already includes a lot of the models implemented in Darts and we need to assess the amount of efforts necessary to maintain such a thing.

madtoinou avatar Feb 13 '25 10:02 madtoinou

I think it would be most helpful to maintain a leaderboard over a few standard datasets broken out by model (perhaps with/without additional feature engineering?) That could highlight how performance is tuned for each model and also would likely speed up discovery of bugs or implementation issues that could severely impact performance and are difficult to unit test for like 2408 or 2492

Obviously it's unreasonable to expect the maintainers to do all of the experiments, but if it can be set up in a way that allows easy submission and evaluation there might be good community engagement.

eschibli avatar Feb 25 '25 21:02 eschibli