datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat: Determine ordering of file groups

Open suremarc opened this issue 11 months ago • 21 comments

Which issue does this PR close?

Closes #7490 .

Rationale for this change

See details in #7490 - this feature helps DataFusion eliminate sorts when files can be shown to be non-overlapping in terms of min/max statistics.

What changes are included in this PR?

  • Add a new FileScanConfig::sort_file_groups method that distribute files via a bin packing algorithm, ensuring that no two files have overlapping statistics
  • Make FileScanConfig::project check if file groups are sorted when determining projected output orderings
  • Add a new internal MinMaxStatistics struct that uses the Arrow Row API to efficiently sort & compare file statistics.

Are these changes tested?

Yes - there is a unit test and a sqllogictest.

Are there any user-facing changes?

Yes - there is a new optional statistics field in PartitionedFile, which is part of the proposal in #7490.

There is also the new FileScanConfig::sort_file_groups API

suremarc avatar Mar 13 '24 08:03 suremarc

/benchmark

Dandandan avatar Mar 15 '24 09:03 Dandandan

/benchmark

Dandandan avatar Mar 16 '24 01:03 Dandandan

Benchmark results

Benchmarks comparing 3d130915b418931696dbf682106fd3c751677c91 (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR)
Comparing 3d13091 and cca5f0f
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/3d13091/tpch.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/cca5f0f/tpch.json does not exist

github-actions[bot] avatar Mar 16 '24 01:03 github-actions[bot]

/benchmark

Dandandan avatar Mar 16 '24 08:03 Dandandan

@suremarc sorry for the noise, just trying to run the benchmark command!

Dandandan avatar Mar 16 '24 08:03 Dandandan

Benchmark results

Benchmarks comparing 6e90f01c413a7b9178fbb0757c69e3d6754474be (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR)
Comparing 6e90f01 and cca5f0f
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/6e90f01/tpch.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/cca5f0f/tpch.json does not exist

github-actions[bot] avatar Mar 16 '24 09:03 github-actions[bot]

/benchmark

Dandandan avatar Mar 16 '24 15:03 Dandandan

Benchmark results

Benchmarks comparing 015cfe864e3b813d05f8bf513a127696ba28ec7f (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR)
Comparing 015cfe8 and cca5f0f
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  015cfe8 ┃  cca5f0f ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 433.61ms │ 437.25ms │    no change │
│ QQuery 2     │  58.77ms │  58.63ms │    no change │
│ QQuery 3     │ 143.19ms │ 141.75ms │    no change │
│ QQuery 4     │  87.73ms │  90.93ms │    no change │
│ QQuery 5     │ 196.32ms │ 198.59ms │    no change │
│ QQuery 6     │ 102.49ms │ 138.43ms │ 1.35x slower │
│ QQuery 7     │ 280.24ms │ 295.44ms │ 1.05x slower │
│ QQuery 8     │ 199.24ms │ 191.96ms │    no change │
│ QQuery 9     │ 292.84ms │ 299.26ms │    no change │
│ QQuery 10    │ 231.49ms │ 236.37ms │    no change │
│ QQuery 11    │  62.28ms │  62.52ms │    no change │
│ QQuery 12    │ 127.02ms │ 127.75ms │    no change │
│ QQuery 13    │ 188.92ms │ 184.02ms │    no change │
│ QQuery 14    │ 128.87ms │ 127.64ms │    no change │
│ QQuery 15    │ 188.04ms │ 198.36ms │ 1.05x slower │
│ QQuery 16    │  52.05ms │  51.22ms │    no change │
│ QQuery 17    │ 301.30ms │ 297.40ms │    no change │
│ QQuery 18    │ 437.03ms │ 431.55ms │    no change │
│ QQuery 19    │ 227.47ms │ 228.29ms │    no change │
│ QQuery 20    │ 189.35ms │ 191.53ms │    no change │
│ QQuery 21    │ 323.42ms │ 327.57ms │    no change │
│ QQuery 22    │  54.84ms │  55.18ms │    no change │
└──────────────┴──────────┴──────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (015cfe8)   │ 4306.50ms │
│ Total Time (cca5f0f)   │ 4371.63ms │
│ Average Time (015cfe8) │  195.75ms │
│ Average Time (cca5f0f) │  198.71ms │
│ Queries Faster         │         0 │
│ Queries Slower         │         3 │
│ Queries with No Change │        19 │
└────────────────────────┴───────────┘

github-actions[bot] avatar Mar 16 '24 15:03 github-actions[bot]

Running /benchmark once more to see if previous run was noise

Dandandan avatar Mar 16 '24 15:03 Dandandan

Fwiw, I've noticed that it seem to have a minor systematic bias against the PR results in general (even if e.g. there are literally no code changes in it). This manifests like above, by reporting a couple of queries being marginally slower.

I think this will be alleviated once a dedicated runner is used, and/or when more benchmark types are included.

gruuya avatar Mar 16 '24 15:03 gruuya

Benchmark results

Benchmarks comparing 015cfe864e3b813d05f8bf513a127696ba28ec7f (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR)
Comparing 015cfe8 and cca5f0f
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  015cfe8 ┃  cca5f0f ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 433.40ms │ 444.84ms │     no change │
│ QQuery 2     │  58.23ms │  59.08ms │     no change │
│ QQuery 3     │ 142.62ms │ 143.21ms │     no change │
│ QQuery 4     │  88.02ms │  87.36ms │     no change │
│ QQuery 5     │ 195.23ms │ 196.83ms │     no change │
│ QQuery 6     │ 103.54ms │ 103.05ms │     no change │
│ QQuery 7     │ 279.11ms │ 277.24ms │     no change │
│ QQuery 8     │ 200.55ms │ 198.16ms │     no change │
│ QQuery 9     │ 296.86ms │ 291.54ms │     no change │
│ QQuery 10    │ 234.67ms │ 323.08ms │  1.38x slower │
│ QQuery 11    │  63.09ms │  89.66ms │  1.42x slower │
│ QQuery 12    │ 125.98ms │ 127.29ms │     no change │
│ QQuery 13    │ 178.71ms │ 180.05ms │     no change │
│ QQuery 14    │ 127.44ms │ 129.93ms │     no change │
│ QQuery 15    │ 186.88ms │ 191.55ms │     no change │
│ QQuery 16    │  52.93ms │  49.94ms │ +1.06x faster │
│ QQuery 17    │ 310.79ms │ 291.79ms │ +1.07x faster │
│ QQuery 18    │ 442.30ms │ 447.07ms │     no change │
│ QQuery 19    │ 224.82ms │ 228.82ms │     no change │
│ QQuery 20    │ 189.15ms │ 187.59ms │     no change │
│ QQuery 21    │ 325.20ms │ 329.43ms │     no change │
│ QQuery 22    │  53.58ms │  55.04ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (015cfe8)   │ 4313.09ms │
│ Total Time (cca5f0f)   │ 4432.56ms │
│ Average Time (015cfe8) │  196.05ms │
│ Average Time (cca5f0f) │  201.48ms │
│ Queries Faster         │         2 │
│ Queries Slower         │         2 │
│ Queries with No Change │        18 │
└────────────────────────┴───────────┘

github-actions[bot] avatar Mar 16 '24 15:03 github-actions[bot]

Yeah seems noise, different queries running slower in second run.

Dandandan avatar Mar 16 '24 16:03 Dandandan

I will review this either today or tomorrow

NGA-TRAN avatar Mar 18 '24 19:03 NGA-TRAN

@alamb @NGA-TRAN

Thanks for the swift action - I guess I'll go ahead and mark this as ready for review since it's being reviewed, haha.

I'm still planning to add an end-to-end test with ListingTable but just haven't figured out how to do it yet

suremarc avatar Mar 18 '24 20:03 suremarc

Sorry I could not get to this today. I will try to review it tomorrow/later this week

NGA-TRAN avatar Mar 19 '24 22:03 NGA-TRAN

I am reviewing this now

NGA-TRAN avatar Mar 20 '24 21:03 NGA-TRAN

@NGA-TRAN I added a sqllogictest and fixed some bugs in the implementation. CI seems to be passing now

suremarc avatar Mar 21 '24 14:03 suremarc

Thanks @suremarc . I am still in the middle of the review. Sorry it takes more time. So far the code looks really good

NGA-TRAN avatar Mar 21 '24 15:03 NGA-TRAN

@alamb I won't be able to get to this until tomorrow, just letting you know

suremarc avatar Mar 26 '24 15:03 suremarc

@alamb I won't be able to get to this until tomorrow, just letting you know

No worries - thanks for doing it @suremarc -- I am traveling this week so I won't have time to review until later this week.

Thank you so much

alamb avatar Mar 27 '24 03:03 alamb

I think this PR is quite nice and it would be great if we could get the tests written and the code polished up.

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

alamb avatar Mar 30 '24 20:03 alamb

@NGA-TRAN do you have time to review this PR as well?

alamb avatar Apr 29 '24 15:04 alamb

No worries @suremarc -- I am very excited about this PR. I plan to review it sometime this week (hopefully later today)

alamb avatar Apr 29 '24 15:04 alamb

I will review this either today or tomorrow

NGA-TRAN avatar Apr 29 '24 16:04 NGA-TRAN

I'm not sure why changing a comment caused the tests to start failing.... oof.

suremarc avatar Apr 30 '24 20:04 suremarc

Added API change label as I it adds a new field to PartitionedFile

alamb avatar Apr 30 '24 20:04 alamb

@alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.

suremarc avatar May 01 '24 15:05 suremarc

@alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.

Sorry -- sounds good. I am going to give this PR another look and file some follow on tickets.

alamb avatar May 01 '24 20:05 alamb

Filed https://github.com/apache/datafusion/issues/10336 to track enable this flag by default

alamb avatar May 01 '24 20:05 alamb