Use the same column data types for all engines in benchmarks

Open MrPowers opened this issue 1 year ago • 1 comments

Here's a snippet from the Polars groupby benchmarks:

pl.read_csv(src_grp, schema_overrides={"id4":pl.Int32, "id5":pl.Int32, "id6":pl.Int32, "v1":pl.Int32, "v2":pl.Int32

Looks like id4, id5, id6 and v1 are using Int32 columns.

Other engines, like Spark, are just inferring the column types:

x = spark.read.csv(src_grp, header=True, inferSchema='true')

I think we should either have all the engines infer the column data types or all the engines specify the column data types for a better comparison. It's not apples:apples if some engines are using int32 and others are using int64.

Dec 13 '24 18:12 MrPowers

I agree that all engines should attempt to use the same types.

It's important to note, however, that some of the aggregations have answers that overflow to int64, while all inputs are int32. I think polars had this issue somewhere.

Also, I only have a limited time to work on this benchmark, and it is mostly for maintenance and updating solutions. I don't have much time to go through every solution to ensure the setup for each system is the exact same. I am happy to review PRs if the come up.

Jan 13 '25 16:01 Tmonster