datafusion Prune files during streams and avoid additional pruning if there are no dynamic filters

https://github.com/apache/datafusion/pull/16014#issuecomment-2977125894

Jun 16 '25 16:06 adriangb

cc @alamb I think this resolves the concern about perf overhead of this late pruning when there are no dynamic filters; it's a tossup of what happens when there are dynamic filters, in the case of a topk with large files it's clearly a win, but there could obviously be cases where the additional checks are more overhead if they don't result in early termination of the streams

Jun 16 '25 16:06 adriangb

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing prune-rg (936e039c84190e7345a5b4cff25d5e043c7b18d6) to dd936cb1b25cb685e0e146f297950eb00048c64c diff Benchmarks: tpch_mem clickbench_partitioned clickbench_extended Results will be posted here when complete

Jun 16 '25 21:06 alamb

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge https://github.com/apache/datafusion/pull/15770 and then we can benchmark this?

Jun 16 '25 21:06 adriangb

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1879.99 ms │  1939.08 ms │ no change │
│ QQuery 1     │   693.65 ms │   708.16 ms │ no change │
│ QQuery 2     │  1355.42 ms │  1393.13 ms │ no change │
│ QQuery 3     │   669.90 ms │   672.00 ms │ no change │
│ QQuery 4     │  1337.47 ms │  1363.59 ms │ no change │
│ QQuery 5     │ 15038.90 ms │ 15112.89 ms │ no change │
│ QQuery 6     │  1986.81 ms │  1965.90 ms │ no change │
│ QQuery 7     │  1929.58 ms │  1936.37 ms │ no change │
│ QQuery 8     │   799.31 ms │   798.86 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 25691.04ms │
│ Total Time (prune-rg)   │ 25889.98ms │
│ Average Time (HEAD)     │  2854.56ms │
│ Average Time (prune-rg) │  2876.66ms │
│ Queries Faster          │          0 │
│ Queries Slower          │          0 │
│ Queries with No Change  │          9 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    16.59 ms │    15.61 ms │ +1.06x faster │
│ QQuery 1     │    33.33 ms │    32.71 ms │     no change │
│ QQuery 2     │    80.97 ms │    79.03 ms │     no change │
│ QQuery 3     │    98.84 ms │   101.19 ms │     no change │
│ QQuery 4     │   589.88 ms │   617.40 ms │     no change │
│ QQuery 5     │   822.41 ms │   848.28 ms │     no change │
│ QQuery 6     │    23.72 ms │    23.39 ms │     no change │
│ QQuery 7     │    36.21 ms │    35.69 ms │     no change │
│ QQuery 8     │   857.78 ms │   867.88 ms │     no change │
│ QQuery 9     │  1165.86 ms │  1172.75 ms │     no change │
│ QQuery 10    │   252.83 ms │   253.69 ms │     no change │
│ QQuery 11    │   282.67 ms │   281.46 ms │     no change │
│ QQuery 12    │   854.98 ms │   846.96 ms │     no change │
│ QQuery 13    │  1257.05 ms │  1266.75 ms │     no change │
│ QQuery 14    │   801.95 ms │   785.82 ms │     no change │
│ QQuery 15    │   777.20 ms │   764.66 ms │     no change │
│ QQuery 16    │  1633.65 ms │  1595.44 ms │     no change │
│ QQuery 17    │  1595.53 ms │  1582.06 ms │     no change │
│ QQuery 18    │  2896.26 ms │  2864.05 ms │     no change │
│ QQuery 19    │    86.57 ms │    84.57 ms │     no change │
│ QQuery 20    │  1119.10 ms │  1094.90 ms │     no change │
│ QQuery 21    │  1243.84 ms │  1271.98 ms │     no change │
│ QQuery 22    │  2064.74 ms │  2070.37 ms │     no change │
│ QQuery 23    │  7537.75 ms │  7537.46 ms │     no change │
│ QQuery 24    │   446.88 ms │   445.02 ms │     no change │
│ QQuery 25    │   367.87 ms │   374.34 ms │     no change │
│ QQuery 26    │   503.34 ms │   506.30 ms │     no change │
│ QQuery 27    │  1481.94 ms │  1504.55 ms │     no change │
│ QQuery 28    │ 11763.16 ms │ 11902.61 ms │     no change │
│ QQuery 29    │   525.71 ms │   532.18 ms │     no change │
│ QQuery 30    │   752.63 ms │   753.79 ms │     no change │
│ QQuery 31    │   801.83 ms │   796.85 ms │     no change │
│ QQuery 32    │  2494.56 ms │  2480.09 ms │     no change │
│ QQuery 33    │  3143.19 ms │  3172.92 ms │     no change │
│ QQuery 34    │  3150.66 ms │  3179.61 ms │     no change │
│ QQuery 35    │  1225.93 ms │  1238.68 ms │     no change │
│ QQuery 36    │   123.81 ms │   124.66 ms │     no change │
│ QQuery 37    │    55.59 ms │    55.17 ms │     no change │
│ QQuery 38    │   121.16 ms │   124.64 ms │     no change │
│ QQuery 39    │   195.41 ms │   195.87 ms │     no change │
│ QQuery 40    │    46.69 ms │    48.51 ms │     no change │
│ QQuery 41    │    44.18 ms │    43.08 ms │     no change │
│ QQuery 42    │    39.09 ms │    38.68 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 53413.35ms │
│ Total Time (prune-rg)   │ 53611.67ms │
│ Average Time (HEAD)     │  1242.17ms │
│ Average Time (prune-rg) │  1246.78ms │
│ Queries Faster          │          1 │
│ Queries Slower          │          0 │
│ Queries with No Change  │         42 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃  prune-rg ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 100.52 ms │ 100.26 ms │    no change │
│ QQuery 2     │  21.21 ms │  21.35 ms │    no change │
│ QQuery 3     │  32.52 ms │  32.87 ms │    no change │
│ QQuery 4     │  19.08 ms │  18.72 ms │    no change │
│ QQuery 5     │  51.07 ms │  50.15 ms │    no change │
│ QQuery 6     │  11.90 ms │  12.19 ms │    no change │
│ QQuery 7     │  85.38 ms │  89.71 ms │ 1.05x slower │
│ QQuery 8     │  24.32 ms │  25.05 ms │    no change │
│ QQuery 9     │  53.59 ms │  54.12 ms │    no change │
│ QQuery 10    │  43.80 ms │  43.31 ms │    no change │
│ QQuery 11    │  11.57 ms │  11.31 ms │    no change │
│ QQuery 12    │  35.33 ms │  34.53 ms │    no change │
│ QQuery 13    │  25.59 ms │  26.29 ms │    no change │
│ QQuery 14    │   9.80 ms │   9.68 ms │    no change │
│ QQuery 15    │  18.63 ms │  19.74 ms │ 1.06x slower │
│ QQuery 16    │  19.16 ms │  18.61 ms │    no change │
│ QQuery 17    │  97.35 ms │  96.99 ms │    no change │
│ QQuery 18    │ 205.89 ms │ 200.40 ms │    no change │
│ QQuery 19    │  27.10 ms │  26.81 ms │    no change │
│ QQuery 20    │  32.14 ms │  32.06 ms │    no change │
│ QQuery 21    │ 152.12 ms │ 148.42 ms │    no change │
│ QQuery 22    │  15.22 ms │  15.37 ms │    no change │
└──────────────┴───────────┴───────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 1093.28ms │
│ Total Time (prune-rg)   │ 1087.91ms │
│ Average Time (HEAD)     │   49.69ms │
│ Average Time (prune-rg) │   49.45ms │
│ Queries Faster          │         0 │
│ Queries Slower          │         2 │
│ Queries with No Change  │        20 │
│ Queries with Failure    │         0 │
└─────────────────────────┴───────────┘

Jun 16 '25 22:06 alamb

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing prune-rg (936e039c84190e7345a5b4cff25d5e043c7b18d6) to dd936cb1b25cb685e0e146f297950eb00048c64c diff Benchmarks: clickbench_1 Results will be posted here when complete

Jun 16 '25 22:06 alamb

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    48.55 ms │    48.75 ms │     no change │
│ QQuery 1     │    74.22 ms │    74.11 ms │     no change │
│ QQuery 2     │   109.42 ms │   109.88 ms │     no change │
│ QQuery 3     │   129.53 ms │   122.61 ms │ +1.06x faster │
│ QQuery 4     │   627.55 ms │   625.42 ms │     no change │
│ QQuery 5     │   849.86 ms │   849.16 ms │     no change │
│ QQuery 6     │    57.05 ms │    56.90 ms │     no change │
│ QQuery 7     │    80.49 ms │    82.69 ms │     no change │
│ QQuery 8     │   879.81 ms │   876.39 ms │     no change │
│ QQuery 9     │  1165.56 ms │  1167.48 ms │     no change │
│ QQuery 10    │   291.55 ms │   293.01 ms │     no change │
│ QQuery 11    │   318.77 ms │   322.61 ms │     no change │
│ QQuery 12    │   854.77 ms │   844.91 ms │     no change │
│ QQuery 13    │  1228.41 ms │  1205.05 ms │     no change │
│ QQuery 14    │   795.93 ms │   780.73 ms │     no change │
│ QQuery 15    │   809.83 ms │   797.08 ms │     no change │
│ QQuery 16    │  1624.25 ms │  1631.25 ms │     no change │
│ QQuery 17    │  1610.43 ms │  1592.79 ms │     no change │
│ QQuery 18    │  2880.16 ms │  2972.94 ms │     no change │
│ QQuery 19    │   126.17 ms │   122.33 ms │     no change │
│ QQuery 20    │  1168.29 ms │  1145.79 ms │     no change │
│ QQuery 21    │  1332.98 ms │  1326.28 ms │     no change │
│ QQuery 22    │  2301.13 ms │  2296.25 ms │     no change │
│ QQuery 23    │  7739.93 ms │  7786.37 ms │     no change │
│ QQuery 24    │   480.87 ms │   468.63 ms │     no change │
│ QQuery 25    │   407.47 ms │   407.81 ms │     no change │
│ QQuery 26    │   538.00 ms │   537.92 ms │     no change │
│ QQuery 27    │  1622.71 ms │  1634.71 ms │     no change │
│ QQuery 28    │ 12496.80 ms │ 12414.28 ms │     no change │
│ QQuery 29    │   555.02 ms │   572.82 ms │     no change │
│ QQuery 30    │   778.06 ms │   776.41 ms │     no change │
│ QQuery 31    │   851.43 ms │   834.85 ms │     no change │
│ QQuery 32    │  2531.83 ms │  2507.16 ms │     no change │
│ QQuery 33    │  3255.20 ms │  3232.13 ms │     no change │
│ QQuery 34    │  3300.27 ms │  3271.05 ms │     no change │
│ QQuery 35    │  1250.13 ms │  1217.49 ms │     no change │
│ QQuery 36    │   173.22 ms │   169.50 ms │     no change │
│ QQuery 37    │   101.32 ms │   101.13 ms │     no change │
│ QQuery 38    │   170.56 ms │   167.80 ms │     no change │
│ QQuery 39    │   251.74 ms │   251.91 ms │     no change │
│ QQuery 40    │    87.40 ms │    89.54 ms │     no change │
│ QQuery 41    │    86.72 ms │    84.18 ms │     no change │
│ QQuery 42    │    75.34 ms │    77.39 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 56118.72ms │
│ Total Time (prune-rg)   │ 55947.49ms │
│ Average Time (HEAD)     │  1305.09ms │
│ Average Time (prune-rg) │  1301.10ms │
│ Queries Faster          │          1 │
│ Queries Slower          │          0 │
│ Queries with No Change  │         42 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘

Jun 16 '25 22:06 alamb

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge #15770 and then we can benchmark this?

I want to make sure the overhead of checking the predicates on each incoming batch didn't slow things down

Jun 17 '25 11:06 alamb

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge #15770 and then we can benchmark this?

I want to make sure the overhead of checking the predicates on each incoming batch didn't slow things down

If you check the code that only happens if there are dynamic filters. And since there are non right now it becomes just a if let Some(file_pruner) = file_pruner.as_ref() check which is going to be too cheap to show up in benchmarks.

The only way to actually verify will be to merge https://github.com/apache/datafusion/pull/15770 and then compare this PR to main.

Jun 17 '25 11:06 adriangb

@adriangb I'll review tomorrow, today have some other things

Jun 17 '25 12:06 xudong963

@alamb sorry for the ping but would you mind running topk_tpch on here?

Jun 17 '25 18:06 adriangb

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

Jun 17 '25 19:06 alamb

🤖 ./gh_compare_branch.sh Benchmark Script Running Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing prune-rg (54b3bbf3f3a3c96162b7fb95a70f9a2657dbc7d3) to 1429c92474238a91a09f1cd4a68c19d03329b6a7 diff Benchmarks: topk_tpch Results will be posted here when complete

Jun 17 '25 19:06 alamb

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

I was reading that Arrow has requested AWS credits https://lists.apache.org/thread/q33oofy2v3zpg9s9l8o0w68rmjr3ocsv . Perhaps we can utilize one of those for that use case.

Jun 17 '25 19:06 Dandandan

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark run_topk_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃  prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  26.17 ms │  33.18 ms │  1.27x slower │
│ Q2           │  38.44 ms │  34.00 ms │ +1.13x faster │
│ Q3           │  97.05 ms │ 101.20 ms │     no change │
│ Q4           │  36.71 ms │  40.95 ms │  1.12x slower │
│ Q5           │  25.59 ms │  32.41 ms │  1.27x slower │
│ Q6           │  54.01 ms │  54.31 ms │     no change │
│ Q7           │ 146.60 ms │ 137.02 ms │ +1.07x faster │
│ Q8           │  79.27 ms │  88.55 ms │  1.12x slower │
│ Q9           │ 102.21 ms │ 112.97 ms │  1.11x slower │
│ Q10          │ 174.11 ms │ 188.49 ms │  1.08x slower │
│ Q11          │ 103.82 ms │  91.26 ms │ +1.14x faster │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary       ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (HEAD)       │ 883.98ms │
│ Total Time (prune-rg)   │ 914.34ms │
│ Average Time (HEAD)     │  80.36ms │
│ Average Time (prune-rg) │  83.12ms │
│ Queries Faster          │        3 │
│ Queries Slower          │        6 │
│ Queries with No Change  │        2 │
│ Queries with Failure    │        0 │
└─────────────────────────┴──────────┘

Jun 17 '25 19:06 alamb

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

I was reading that Arrow has requested / recieved AWS credits https://lists.apache.org/thread/q33oofy2v3zpg9s9l8o0w68rmjr3ocsv . Perhaps we can utilize one of those for that use case.

I tried to ask GCS for credits... they didn't seem excited and ultimately came up with nothing.

Jun 17 '25 20:06 adriangb

🤖: Benchmark completed

Details

Interesting results. I'm inclined to believe that the speedups and slowdowns are both real. We'll have to think about this a bit more.

Jun 17 '25 20:06 adriangb

@Dandandan @alamb I pushed ebe4196 which adds a very cheap way to track changes to a PhysicalExpr if it's dynamic. I think this will be useful in several places but immediately it gives us the ability to check if the dynamic predicate has been updated before doing the work of re-calculating the pruning predicate, etc.

I'm still not sure it will be cheap enough, but I think it's worth a shot if we can re-run the benches.

It'll be a shame if we can't figure this out, I think if we are able to get this working it mostly negates the unfortunate situation right now that if you have a TopK it might be faster with less parallelism / partitioning upfront. With this change you still open the files but are able to quickly bail out as opposed to having to stream the whole thing.

Jun 17 '25 21:06 adriangb

I think this will require @Dandandan 's suggestion of only updating the filters if the new ones are more selective: #16433.

Right now since we always update the filters -> it always bumps the generation -> we always re-check.

Jun 17 '25 23:06 adriangb

@alamb I reverted the filtering during the stream so this should now do strictly less work 😄

Jun 19 '25 20:06 adriangb

Also thank you @xudong963

Jun 24 '25 19:06 alamb

@alamb I added test assertions to confirm the stats are working correctly which addresses https://github.com/apache/datafusion/issues/16402

Jun 25 '25 15:06 adriangb

@xudong963 @alamb I've re-organized this to incorporate https://github.com/apache/datafusion/pull/16549.

Sadly I did not catch in that PR that we were putting everything in lib.rs which I felt now is too bloated if I put FilePruner in there. So I moved PruningPredicate & co to pruning_predicate.rs - hence the huge diff line count.

I'll also point out that this now only does the extra work if it has either a dynamic filter OR the file has statistics already collected.

Jun 26 '25 00:06 adriangb

Let's try and get this merged soon to avoid conflicts as much as possible

Jun 26 '25 15:06 alamb

Agreed. I'm struggling with the 3 failing tests because they fail in CI but I can't get them to fail locally...

Jun 26 '25 15:06 adriangb

Something else we could potentially do is to do the refactor of pruning predicate into its own modules as a separate PR so it would be easier to find the mechanical from the algorithmic changes

Jun 26 '25 18:06 alamb

I feel like it must be a dumb mistake. Give me a bit of time to try to sort it out please. I'll deal with any conflicts.

Jun 26 '25 19:06 adriangb

@alamb I've found the issue! The files_pruned_statistics metric is not actually the number of files: it is the number of times FileOpener::open was called which may be >1 per file if the file is split up into multiple ranges, which happens in the number of partitions > number of files! So it varies based on number of CPUs.

Options are:

Rename the metric to reflect that it's actually pruning file opens not files, something like file_opens_pruned_statistics or file_ranges_pruned_statistics.
Try to figure out a way to track the actual files pruned. I think this may be a dead end because e.g. a file may be half pruned (one range is scanned one range is pruned).

Jun 27 '25 14:06 adriangb

What I've done for now is set the target partitions to 1. I think that's reasonable for these tests in general. I opened https://github.com/apache/datafusion/issues/16586 to track renaming the metric.

Jun 27 '25 15:06 adriangb

@alamb to make this PR easier to review I opened https://github.com/apache/datafusion/pull/16587 which absorbs most of the diff. Once we merge that I'll rebase this and we can even discuss the metric rename here (the diff will be much more readable).

Jun 27 '25 15:06 adriangb

Rebased and I renamed the metric + added documentation such that this now closes #16586

Jun 27 '25 17:06 adriangb