datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Enable `split_file_groups_by_statistics` by default

Open alamb opened this issue 1 year ago • 3 comments

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/10313

In https://github.com/apache/datafusion/pull/9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap

This feature is behind a feature flag, split_file_groups_by_statistics which defaults to false as I think there needs to be some more tests in place before we turn it on

Describe the solution you'd like

Add additional tests and then enable split_file_groups_by_statistics by default

Describe alternatives you've considered

No response

Additional context

No response

alamb avatar May 01 '24 20:05 alamb

Example test coverage we should add I think: https://github.com/apache/datafusion/pull/9593#discussion_r1585517605

alamb avatar May 01 '24 20:05 alamb

I'd like to help it. 🙌

yyy1000 avatar May 04 '24 02:05 yyy1000

THank you @yyy1000 🙏

I think a good place to start would be to write some sqllogic level tests to cover the important cases

Perhaos for the first test:

  1. Create files: file1.parquet, file2.parquet both sorted on a but file 1 has the columns in the order a, b, c and file has the columns in the order c, b, a. The keyranges of values of a should be non overlapping
  2. Create an external table a, b, c with explicit order by a, and then query SELECT ... ORDER BY a and make sure the output plan doesn't use sort preserving merge

I think we could extend https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt

cc @suremarc

alamb avatar May 04 '24 11:05 alamb

One thing I've noticed is that after DataFusion 40 this actually works in my use case, likely thanks to the statistics code getting fixed, so good news there! It does require additionally setting execution.collect_statistics = true, which makes sense.

However for my entirely sorted and non-overlapping dataset it did make Parquet scanning single-threaded (ParquetScan with a single file group), which is a big performance regression. So it didn't really help me, maybe I actually want #10316.

The consequence to this issue being that turning this on by default would regress performance for users that have execution.collect_statistics = true. Maybe the flag should be merged with prefer_existing_sort, which has the semantics of avoiding sorts at the cost of limiting parallelism. Or maybe just wait for #10316, so we can both avoid the sort and still have a parallel ParquetExec.

leoyvens avatar Jul 23 '24 18:07 leoyvens

Sorry for the delay @leoyvens and thank you for this analysis

https://github.com/apache/datafusion/issues/11170

I would personally love to take this approach

alamb avatar Jul 29 '24 16:07 alamb