benchm-ml icon indicating copy to clipboard operation
benchm-ml copied to clipboard

More datasets and regression problems

Open PhilippPro opened this issue 6 years ago • 4 comments

Did you consider using more datasets?

And how about regression problems?

There is for example this benchmarking suite, accessible via the OpenML packages: https://arxiv.org/abs/1708.03731

PhilippPro avatar Feb 12 '18 16:02 PhilippPro

Re more datasets: https://github.com/szilard/GBM-perf/issues/4#issuecomment-362651796

My focus now is top GBM implementations (including on GPUs). Doing more by doing less. I dockerized the most important things in a separate repo https://github.com/szilard/GBM-perf

Also read this summary I wrote recently: https://github.com/szilard/benchm-ml#summary

szilard avatar Feb 12 '18 19:02 szilard

I just watched your talk, very interesting.

In my opinion one of the directions that should be further developed (and you already mentioned) is AutoML: packages for automatic tuning, automatic ensembling, automatic feature engineering etc. in a time efficient way.

PhilippPro avatar Feb 13 '18 13:02 PhilippPro

Oh, I forgot to say last comment that RE OpenML, those datasets are ridiculously small: https://gist.github.com/szilard/b82635fa9060227514af3423b3225a29

There is also another set of datasets, that's also too small datasets: https://gist.github.com/szilard/d8279374646fb5f372317db2a4074f2f

I would want a set of datasets with sizes from 1000 to 10M with median size 100K (so should cover 1K-10K-100K-1M-10M).

RE AutoML: Indeed that's super interesting. However, benchmarking that is way more difficult because there is the tricky tradeoff between computation time and accuracy. I've been looking at a few solutions but nothing formally (just tried out). Btw most of them have GBMs are building blocks, so benchmarking the components can give you already some idea on performance.

Btw when you say my talk, is is the KDD one? That's probably the most up to date, though my experiments with autoML and a few other things/results happened after the talk.

szilard avatar Feb 13 '18 14:02 szilard

Ok, there are only some datasets with size above 10 K in the OpenML or PMLB benchmarking suite.

The AutoML solutions should have a time constraint parameter, so e.g. one can compare the results after 1 hour between these algorithms. Of course in reality they often miss this feature.

Yes, the KDD one, quite inspiring.

PhilippPro avatar Feb 14 '18 10:02 PhilippPro