datafusion
datafusion copied to clipboard
feat: Determine ordering of file groups
Which issue does this PR close?
Closes #7490 .
Rationale for this change
See details in #7490 - this feature helps DataFusion eliminate sorts when files can be shown to be non-overlapping in terms of min/max statistics.
What changes are included in this PR?
- Add a new
FileScanConfig::sort_file_groups
method that distribute files via a bin packing algorithm, ensuring that no two files have overlapping statistics - Make
FileScanConfig::project
check if file groups are sorted when determining projected output orderings - Add a new internal
MinMaxStatistics
struct that uses the Arrow Row API to efficiently sort & compare file statistics.
Are these changes tested?
Yes - there is a unit test and a sqllogictest.
Are there any user-facing changes?
Yes - there is a new optional statistics
field in PartitionedFile
, which is part of the proposal in #7490.
There is also the new FileScanConfig::sort_file_groups
API
/benchmark
/benchmark
Benchmark results
Benchmarks comparing 3d130915b418931696dbf682106fd3c751677c91 (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR)
Comparing 3d13091 and cca5f0f
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/3d13091/tpch.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/cca5f0f/tpch.json does not exist
/benchmark
@suremarc sorry for the noise, just trying to run the benchmark command!
Benchmark results
Benchmarks comparing 6e90f01c413a7b9178fbb0757c69e3d6754474be (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR)
Comparing 6e90f01 and cca5f0f
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/6e90f01/tpch.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/cca5f0f/tpch.json does not exist
/benchmark
Benchmark results
Benchmarks comparing 015cfe864e3b813d05f8bf513a127696ba28ec7f (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR)
Comparing 015cfe8 and cca5f0f
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query ┃ 015cfe8 ┃ cca5f0f ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1 │ 433.61ms │ 437.25ms │ no change │
│ QQuery 2 │ 58.77ms │ 58.63ms │ no change │
│ QQuery 3 │ 143.19ms │ 141.75ms │ no change │
│ QQuery 4 │ 87.73ms │ 90.93ms │ no change │
│ QQuery 5 │ 196.32ms │ 198.59ms │ no change │
│ QQuery 6 │ 102.49ms │ 138.43ms │ 1.35x slower │
│ QQuery 7 │ 280.24ms │ 295.44ms │ 1.05x slower │
│ QQuery 8 │ 199.24ms │ 191.96ms │ no change │
│ QQuery 9 │ 292.84ms │ 299.26ms │ no change │
│ QQuery 10 │ 231.49ms │ 236.37ms │ no change │
│ QQuery 11 │ 62.28ms │ 62.52ms │ no change │
│ QQuery 12 │ 127.02ms │ 127.75ms │ no change │
│ QQuery 13 │ 188.92ms │ 184.02ms │ no change │
│ QQuery 14 │ 128.87ms │ 127.64ms │ no change │
│ QQuery 15 │ 188.04ms │ 198.36ms │ 1.05x slower │
│ QQuery 16 │ 52.05ms │ 51.22ms │ no change │
│ QQuery 17 │ 301.30ms │ 297.40ms │ no change │
│ QQuery 18 │ 437.03ms │ 431.55ms │ no change │
│ QQuery 19 │ 227.47ms │ 228.29ms │ no change │
│ QQuery 20 │ 189.35ms │ 191.53ms │ no change │
│ QQuery 21 │ 323.42ms │ 327.57ms │ no change │
│ QQuery 22 │ 54.84ms │ 55.18ms │ no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (015cfe8) │ 4306.50ms │
│ Total Time (cca5f0f) │ 4371.63ms │
│ Average Time (015cfe8) │ 195.75ms │
│ Average Time (cca5f0f) │ 198.71ms │
│ Queries Faster │ 0 │
│ Queries Slower │ 3 │
│ Queries with No Change │ 19 │
└────────────────────────┴───────────┘
Running /benchmark once more to see if previous run was noise
Fwiw, I've noticed that it seem to have a minor systematic bias against the PR results in general (even if e.g. there are literally no code changes in it). This manifests like above, by reporting a couple of queries being marginally slower.
I think this will be alleviated once a dedicated runner is used, and/or when more benchmark types are included.
Benchmark results
Benchmarks comparing 015cfe864e3b813d05f8bf513a127696ba28ec7f (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR)
Comparing 015cfe8 and cca5f0f
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ 015cfe8 ┃ cca5f0f ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 433.40ms │ 444.84ms │ no change │
│ QQuery 2 │ 58.23ms │ 59.08ms │ no change │
│ QQuery 3 │ 142.62ms │ 143.21ms │ no change │
│ QQuery 4 │ 88.02ms │ 87.36ms │ no change │
│ QQuery 5 │ 195.23ms │ 196.83ms │ no change │
│ QQuery 6 │ 103.54ms │ 103.05ms │ no change │
│ QQuery 7 │ 279.11ms │ 277.24ms │ no change │
│ QQuery 8 │ 200.55ms │ 198.16ms │ no change │
│ QQuery 9 │ 296.86ms │ 291.54ms │ no change │
│ QQuery 10 │ 234.67ms │ 323.08ms │ 1.38x slower │
│ QQuery 11 │ 63.09ms │ 89.66ms │ 1.42x slower │
│ QQuery 12 │ 125.98ms │ 127.29ms │ no change │
│ QQuery 13 │ 178.71ms │ 180.05ms │ no change │
│ QQuery 14 │ 127.44ms │ 129.93ms │ no change │
│ QQuery 15 │ 186.88ms │ 191.55ms │ no change │
│ QQuery 16 │ 52.93ms │ 49.94ms │ +1.06x faster │
│ QQuery 17 │ 310.79ms │ 291.79ms │ +1.07x faster │
│ QQuery 18 │ 442.30ms │ 447.07ms │ no change │
│ QQuery 19 │ 224.82ms │ 228.82ms │ no change │
│ QQuery 20 │ 189.15ms │ 187.59ms │ no change │
│ QQuery 21 │ 325.20ms │ 329.43ms │ no change │
│ QQuery 22 │ 53.58ms │ 55.04ms │ no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (015cfe8) │ 4313.09ms │
│ Total Time (cca5f0f) │ 4432.56ms │
│ Average Time (015cfe8) │ 196.05ms │
│ Average Time (cca5f0f) │ 201.48ms │
│ Queries Faster │ 2 │
│ Queries Slower │ 2 │
│ Queries with No Change │ 18 │
└────────────────────────┴───────────┘
Yeah seems noise, different queries running slower in second run.
I will review this either today or tomorrow
@alamb @NGA-TRAN
Thanks for the swift action - I guess I'll go ahead and mark this as ready for review since it's being reviewed, haha.
I'm still planning to add an end-to-end test with ListingTable
but just haven't figured out how to do it yet
Sorry I could not get to this today. I will try to review it tomorrow/later this week
I am reviewing this now
@NGA-TRAN I added a sqllogictest and fixed some bugs in the implementation. CI seems to be passing now
Thanks @suremarc . I am still in the middle of the review. Sorry it takes more time. So far the code looks really good
@alamb I won't be able to get to this until tomorrow, just letting you know
@alamb I won't be able to get to this until tomorrow, just letting you know
No worries - thanks for doing it @suremarc -- I am traveling this week so I won't have time to review until later this week.
Thank you so much
I think this PR is quite nice and it would be great if we could get the tests written and the code polished up.
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look
@NGA-TRAN do you have time to review this PR as well?
No worries @suremarc -- I am very excited about this PR. I plan to review it sometime this week (hopefully later today)
I will review this either today or tomorrow
I'm not sure why changing a comment caused the tests to start failing.... oof.
Added API change label as I it adds a new field to PartitionedFile
@alamb I added a config value, and I moved MinMaxStatistics
to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.
@alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.
Sorry -- sounds good. I am going to give this PR another look and file some follow on tickets.
Filed https://github.com/apache/datafusion/issues/10336 to track enable this flag by default