Enable `split_file_groups_by_statistics` by default
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/10313
In https://github.com/apache/datafusion/pull/9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap
This feature is behind a feature flag, split_file_groups_by_statistics which defaults to false as I think there needs to be some more tests in place before we turn it on
Describe the solution you'd like
Add additional tests and then enable split_file_groups_by_statistics by default
Describe alternatives you've considered
No response
Additional context
No response
Example test coverage we should add I think: https://github.com/apache/datafusion/pull/9593#discussion_r1585517605
I'd like to help it. 🙌
THank you @yyy1000 🙏
I think a good place to start would be to write some sqllogic level tests to cover the important cases
Perhaos for the first test:
- Create files: file1.parquet, file2.parquet both sorted on
abut file 1 has the columns in the ordera, b, cand file has the columns in the orderc, b, a. The keyranges of values of a should be non overlapping - Create an external table
a, b, cwith explicit order bya,and then querySELECT ... ORDER BY aand make sure the output plan doesn't use sort preserving merge
I think we could extend https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt
cc @suremarc
One thing I've noticed is that after DataFusion 40 this actually works in my use case, likely thanks to the statistics code getting fixed, so good news there! It does require additionally setting execution.collect_statistics = true, which makes sense.
However for my entirely sorted and non-overlapping dataset it did make Parquet scanning single-threaded (ParquetScan with a single file group), which is a big performance regression. So it didn't really help me, maybe I actually want #10316.
The consequence to this issue being that turning this on by default would regress performance for users that have execution.collect_statistics = true. Maybe the flag should be merged with prefer_existing_sort, which has the semantics of avoiding sorts at the cost of limiting parallelism. Or maybe just wait for #10316, so we can both avoid the sort and still have a parallel ParquetExec.
Sorry for the delay @leoyvens and thank you for this analysis
https://github.com/apache/datafusion/issues/11170
I would personally love to take this approach