dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Implementing TPC-H

Open amueller opened this issue 2 years ago • 4 comments

Has anyone thought about implementing TPC-H using the dataframe API? I think this would be very useful to test the scope, and also to draw attention to the dataframe API. It would mean that anyone implementing the dataframe API could immediately get an apples-to-apples benchmark of performance.

Whether TPC-H is a good benchmark for dataframes is maybe not entirely clear, but it's the best there is right now AFAIK.

If we can make it so that polars, modin and duckdb run their comparisons via the dataframe API, I think that would be pretty sweet.

You can see the polars implementation of TPC-H here: https://github.com/pola-rs/tpch results here: https://www.pola.rs/benchmarks.html

amueller avatar Sep 28 '23 15:09 amueller

Great suggestion!

I'll try this out and see how far I get, it'll likely highlight some missing areas

It would require that the dataframe-api would have to be as close to a zero-cost abstraction as possible - https://github.com/data-apis/dataframe-api/pull/249 would bring us a lot closer to that goal, so if you had any input there I'd really appreciate it

thanks 🙏

MarcoGorelli avatar Sep 28 '23 16:09 MarcoGorelli

Maybe should be tpc-ds or maybe tpcx-xBB

amueller avatar Sep 29 '23 00:09 amueller

We're added a couple of tpc-h examples here:

https://github.com/data-apis/dataframe-api/tree/main/spec/API_specification/examples/tpch

MarcoGorelli avatar Oct 27 '23 15:10 MarcoGorelli