feat: RewriteCycle API for short-circuiting optimizer loops
Which issue does this PR close?
Closes https://github.com/apache/datafusion/issues/1160.
Rationale for this change
This is a follow up to https://github.com/apache/datafusion/pull/10358 with a new approach that should short-circuit earlier. See previous discussion there.
What changes are included in this PR?
Are these changes tested?
yes
Are there any user-facing changes?
no
Benchmark results between the two PRs and main
Here are my measurements on my gcp machine (it does seem to help by 1-2%)
Details
++ critcmp main loop-expr-simplifier-static-dispatch
group loop-expr-simplifier-static-dispatch main
----- ------------------------------------ ----
logical_aggregate_with_join 1.00 1211.4±14.74µs ? ?/sec 1.00 1214.7±64.80µs ? ?/sec
logical_plan_tpcds_all 1.00 156.4±1.31ms ? ?/sec 1.01 158.4±2.20ms ? ?/sec
logical_plan_tpch_all 1.00 16.9±0.19ms ? ?/sec 1.00 16.9±0.17ms ? ?/sec
logical_select_all_from_1000 1.00 18.6±0.17ms ? ?/sec 1.01 18.8±0.13ms ? ?/sec
logical_select_one_from_700 1.00 814.0±10.94µs ? ?/sec 1.00 815.5±32.65µs ? ?/sec
logical_trivial_join_high_numbered_columns 1.00 758.0±10.12µs ? ?/sec 1.00 759.1±19.82µs ? ?/sec
logical_trivial_join_low_numbered_columns 1.00 743.0±8.02µs ? ?/sec 1.01 749.9±9.83µs ? ?/sec
physical_plan_tpcds_all 1.00 1336.8±8.42ms ? ?/sec 1.01 1354.8±8.50ms ? ?/sec
physical_plan_tpch_all 1.00 90.2±1.61ms ? ?/sec 1.03 93.1±1.37ms ? ?/sec
physical_plan_tpch_q1 1.00 4.9±0.07ms ? ?/sec 1.06 5.1±0.09ms ? ?/sec
physical_plan_tpch_q10 1.00 4.3±0.06ms ? ?/sec 1.02 4.4±0.08ms ? ?/sec
physical_plan_tpch_q11 1.00 3.9±0.05ms ? ?/sec 1.02 4.0±0.06ms ? ?/sec
physical_plan_tpch_q12 1.00 3.1±0.07ms ? ?/sec 1.00 3.1±0.08ms ? ?/sec
physical_plan_tpch_q13 1.00 2.1±0.03ms ? ?/sec 1.04 2.2±0.06ms ? ?/sec
physical_plan_tpch_q14 1.00 2.7±0.05ms ? ?/sec 1.06 2.8±0.05ms ? ?/sec
physical_plan_tpch_q16 1.00 3.7±0.08ms ? ?/sec 1.04 3.8±0.09ms ? ?/sec
physical_plan_tpch_q17 1.00 3.5±0.05ms ? ?/sec 1.02 3.6±0.06ms ? ?/sec
physical_plan_tpch_q18 1.00 3.9±0.05ms ? ?/sec 1.00 3.9±0.07ms ? ?/sec
physical_plan_tpch_q19 1.00 6.0±0.07ms ? ?/sec 1.03 6.2±0.08ms ? ?/sec
physical_plan_tpch_q2 1.00 7.7±0.10ms ? ?/sec 1.02 7.9±0.06ms ? ?/sec
physical_plan_tpch_q20 1.00 4.5±0.07ms ? ?/sec 1.04 4.7±0.09ms ? ?/sec
physical_plan_tpch_q21 1.00 6.1±0.07ms ? ?/sec 1.04 6.3±0.06ms ? ?/sec
physical_plan_tpch_q22 1.00 3.3±0.06ms ? ?/sec 1.04 3.4±0.07ms ? ?/sec
physical_plan_tpch_q3 1.00 3.1±0.05ms ? ?/sec 1.01 3.1±0.06ms ? ?/sec
physical_plan_tpch_q4 1.00 2.3±0.05ms ? ?/sec 1.02 2.3±0.05ms ? ?/sec
physical_plan_tpch_q5 1.00 4.5±0.07ms ? ?/sec 1.02 4.5±0.07ms ? ?/sec
physical_plan_tpch_q6 1.00 1538.7±18.47µs ? ?/sec 1.03 1586.4±21.54µs ? ?/sec
physical_plan_tpch_q7 1.00 5.7±0.08ms ? ?/sec 1.01 5.7±0.13ms ? ?/sec
physical_plan_tpch_q8 1.00 7.3±0.10ms ? ?/sec 1.01 7.4±0.08ms ? ?/sec
physical_plan_tpch_q9 1.00 5.6±0.09ms ? ?/sec 1.00 5.6±0.08ms ? ?/sec
physical_select_all_from_1000 1.00 60.6±0.25ms ? ?/sec 1.01 61.4±0.29ms ? ?/sec
physical_select_one_from_700 1.00 3.6±0.05ms ? ?/sec 1.02 3.7±0.04ms ? ?/sec
My second run was consistent:
Details
++ critcmp main loop-expr-simplifier-static-dispatch
group loop-expr-simplifier-static-dispatch main
----- ------------------------------------ ----
logical_aggregate_with_join 1.00 1213.6±9.49µs ? ?/sec 1.01 1220.2±37.22µs ? ?/sec
logical_plan_tpcds_all 1.00 158.8±1.65ms ? ?/sec 1.01 160.0±1.55ms ? ?/sec
logical_plan_tpch_all 1.00 16.9±0.21ms ? ?/sec 1.00 16.9±0.20ms ? ?/sec
logical_select_all_from_1000 1.00 18.8±0.12ms ? ?/sec 1.00 18.7±0.10ms ? ?/sec
logical_select_one_from_700 1.00 809.7±10.72µs ? ?/sec 1.00 812.4±11.32µs ? ?/sec
logical_trivial_join_high_numbered_columns 1.00 756.6±7.03µs ? ?/sec 1.01 761.9±7.95µs ? ?/sec
logical_trivial_join_low_numbered_columns 1.00 742.9±13.09µs ? ?/sec 1.01 748.3±14.13µs ? ?/sec
physical_plan_tpcds_all 1.00 1337.9±9.37ms ? ?/sec 1.01 1346.9±8.49ms ? ?/sec
physical_plan_tpch_all 1.00 92.2±1.47ms ? ?/sec 1.00 92.0±1.27ms ? ?/sec
physical_plan_tpch_q1 1.00 5.0±0.09ms ? ?/sec 1.03 5.1±0.07ms ? ?/sec
physical_plan_tpch_q10 1.00 4.4±0.11ms ? ?/sec 1.00 4.4±0.08ms ? ?/sec
physical_plan_tpch_q11 1.00 3.9±0.06ms ? ?/sec 1.02 4.0±0.08ms ? ?/sec
physical_plan_tpch_q12 1.00 3.1±0.06ms ? ?/sec 1.01 3.1±0.05ms ? ?/sec
physical_plan_tpch_q13 1.00 2.1±0.04ms ? ?/sec 1.00 2.1±0.04ms ? ?/sec
physical_plan_tpch_q14 1.00 2.7±0.05ms ? ?/sec 1.04 2.8±0.06ms ? ?/sec
physical_plan_tpch_q16 1.00 3.7±0.08ms ? ?/sec 1.03 3.8±0.07ms ? ?/sec
physical_plan_tpch_q17 1.01 3.6±0.07ms ? ?/sec 1.00 3.6±0.07ms ? ?/sec
physical_plan_tpch_q18 1.00 4.0±0.07ms ? ?/sec 1.01 4.0±0.05ms ? ?/sec
physical_plan_tpch_q19 1.00 6.2±0.07ms ? ?/sec 1.02 6.3±0.07ms ? ?/sec
physical_plan_tpch_q2 1.00 7.8±0.09ms ? ?/sec 1.01 7.9±0.07ms ? ?/sec
physical_plan_tpch_q20 1.00 4.6±0.08ms ? ?/sec 1.00 4.6±0.07ms ? ?/sec
physical_plan_tpch_q21 1.00 6.2±0.08ms ? ?/sec 1.01 6.3±0.09ms ? ?/sec
physical_plan_tpch_q22 1.00 3.4±0.08ms ? ?/sec 1.02 3.5±0.09ms ? ?/sec
physical_plan_tpch_q3 1.00 3.1±0.07ms ? ?/sec 1.00 3.2±0.05ms ? ?/sec
physical_plan_tpch_q4 1.00 2.3±0.06ms ? ?/sec 1.01 2.3±0.05ms ? ?/sec
physical_plan_tpch_q5 1.00 4.4±0.06ms ? ?/sec 1.03 4.6±0.05ms ? ?/sec
physical_plan_tpch_q6 1.00 1551.7±25.84µs ? ?/sec 1.03 1592.4±22.64µs ? ?/sec
physical_plan_tpch_q7 1.00 5.7±0.06ms ? ?/sec 1.00 5.7±0.09ms ? ?/sec
physical_plan_tpch_q8 1.00 7.4±0.08ms ? ?/sec 1.02 7.5±0.07ms ? ?/sec
physical_plan_tpch_q9 1.00 5.7±0.09ms ? ?/sec 1.00 5.7±0.08ms ? ?/sec
physical_select_all_from_1000 1.00 61.2±0.81ms ? ?/sec 1.00 61.3±0.33ms ? ?/sec
physical_select_one_from_700 1.01 3.7±0.05ms ? ?/sec 1.00 3.6±0.04ms ? ?/sec
Other PR has been merged in: https://github.com/apache/datafusion/pull/10358
I think we can merge / rebase this PR now and mark it ready for review
I will take another look at this and see if I can clean it up a bit more.
Alright I think this is in a good state now. I added a dynamic dispatch API which I think could be useful if we ever add TreeNodeRewriter impl for OptimizerRule, then we could run the whole optimizer through this cycle logic.
I made some stuff public for doctests, but if we don't want to do that I can just mark the doctest as ignored
I removed the dynamic dispatch API for now because I think there is a better way to write it, but I can't really do that until we get to a point where we have a TreeNodeRewriter impl for Arc<dyn OptimizerRule>
I think the logic is correct, although it takes me sometime to understand the difference between the first pass and other pass, but I did not have a better design about this.
I think the logic is correct, although it takes me sometime to understand the difference between the first pass and other pass, but I did not have a better design about this.
We need to figure out how many rewrites are in a cycle. Since there is no Vec or other data structure, we cannot use a length, so we manually count and then record the cycle length at the end of the first pass.
Note that the dynamic dispatch API that I deleted from this PR does not have this problem, since we can simply set cycle_length to vec.len().
You can make this code to not have a special first pass if you wanted to make it simpler to understand. If you check cycle_length.is_none() in record_cycle_length before setting the cycle_length, you could call it at the end of each cycle and then you can just run everything in a single try_fold loop. However, I wanted to avoid unneccessary conditional checks, so I run the first cycle, record the cycle length, the continue with the remaining cycles.
This makes the code and the API kind of complicated for something that should be relatively simple. But I found dynamic dispatch and additional conditional checks to have a significant performance regressions when we're operating over a small number of rewriters (only 3 rewriters here), so I wanted to make an API useable with static dispatch.
A dynamic dispatch API is more appropriate for the top-level optimizer because:
- it's already using dynamic dispatch
- the overhead of dynamic dispatch should be less significant of a % of execution time when we are operating over a larger number of rewriters
Is this PR ready for the next round of review @erratic-pattern ? Or do you plan to make further changes to it?
@alamb I will try to update this today or tomorrow. I've been putting this off a bit.
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look
Checking this out...
I fixed a clippy lint and merged up from main to get the latest changes into this PR
I ran the planning benchmark with this brach an in summary there is basically no performance change
Yes that is what I observed as well. I assume the overhead offsets any potential improvement from skipping steps here. I think the value of this API comes from the potential reuseability of the logic for the entire optimizer. It is not ready to do that in its current state, but with some improvements it could be used across the optimizer.
I will update this PR with review suggestions (maybe) this weekend
Marking as draft as feedback is incorporated
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
Going to update this soon to get it ready for merging.