collapse
collapse copied to clipboard
benchmark feedback
As I don't have twitter I would like to leave feedback about benchmark here. I strongly advise to scale up benchmark so single time measurement is above 1s. Having timings in milliseconds is not well stressing algorithms as it is more affected by overheads. I am not saying to ignore overheads, but to just give bigger picture about scalability. This is mentioned in https://cran.r-project.org/web/packages/data.table/vignettes/datatable-benchmarking.html as well, in case you haven't seen this doc. Warm regards
Thanks Jan, in principle I agree with you, I just ran this on the NCY Taxi dataset used a lot by people presenting arrow at various conferences this year and also on Twitter. It is hard to get execution times longer than 1 second on that one. Will soon add some modifications where I replicate the data a bit to reach longer execution times.
By the way, in my experience arrow is considerably faster than dplyr in basically all constellations, which curiously is not reflected in the data.table benchmarks (https://h2oai.github.io/db-benchmark/).
You might have been using different version of arrow than the one in db-benchmark
Just one of many cases like this, very recent https://stackoverflow.com/questions/73403038/r-data-table-rolling-max#comment129672489_73408459
By the way, in my experience arrow is considerably faster than dplyr in basically all constellations, which curiously is not reflected in the data.table benchmarks (https://h2oai.github.io/db-benchmark/).
The H2O benchmarks are out of date by now.
Please see the latest benchmarks now published by DuckDB.
Arrow is indeed faster than dplyr.
collapse has entered the DuckDB benchmarks, thus I also regard this issue as resolved.