datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add support for clickbench data and benchmark with page index

Open zhuqi-lucas opened this issue 6 months ago • 2 comments

Is your feature request related to a problem or challenge?

Currently, our clickbench benchmark data don't have page index, this ticket will add page index data generator, also add a separate benchmark to support the clickbench with page index.

And may be expose more custom options? Such as page index option, compression option, sort option to generate the data set based old clickbench data set.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

zhuqi-lucas avatar Jun 17 '25 14:06 zhuqi-lucas

take

zhuqi-lucas avatar Jun 17 '25 14:06 zhuqi-lucas

  • https://github.com/apache/datafusion/issues/16200

Will depend on this ticket.

zhuqi-lucas avatar Jun 17 '25 14:06 zhuqi-lucas

Just a thought: do we need an artificial dataset to really highlight the problem / solution? I think it's unlikely to be measurable with a dataset that has 25 columns and 500 row groups, especially if we're talking about avoiding parsing but not even avoiding IO. My guess is if you make a dataset with 10k columns and 1000s of row groups we'll see a difference.

adriangb avatar Jun 23 '25 18:06 adriangb

Thank you @adriangb for this good point, i agree with you, and why i create this jira because we also can use it to mock more custom data based current clickbench, and we can use it for more options.

Just a thought: do we need an artificial dataset to really highlight the problem / solution? I think it's unlikely to be measurable with a dataset that has 25 columns and 500 row groups, especially if we're talking about avoiding parsing but not even avoiding IO. My guess is if you make a dataset with 10k columns and 1000s of row groups we'll see a difference.

zhuqi-lucas avatar Jun 24 '25 04:06 zhuqi-lucas