Fix and enable branch minimization in backend
There is an incomplete jump minimization pass in the backend, that is never used (since the first SVN revision of D2). This PR modernizes and enables the pass. The algorithm is still quadratic, but pathological case is somewhat rare because of restrictions from the glue layer. I took some care not to reorder if-chains to avoid disrupting branch prediction.
Generally, jump minimization reduces frontend stall and provides 5% improvement for branchy code. There is 2% improvement for Jabba Laci's AoC program and very tiny improvement for code in this SO thread (compiled with -inline -O -release, tested with hyperfine -w 3 -r 15, on an Intel i5-12500H with atom cores disabled):
Benchmark 1: ./aoc_brmin
Time (mean ± σ): 3.548 s ± 0.007 s [User: 3.541 s, System: 0.008 s]
Range (min … max): 3.538 s … 3.559 s 15 runs
Benchmark 2: ./aoc_master
Time (mean ± σ): 3.619 s ± 0.007 s [User: 3.610 s, System: 0.010 s]
Range (min … max): 3.606 s … 3.632 s 15 runs
Summary
./aoc_brmin ran
1.02 ± 0.00 times faster than ./aoc_master
Benchmark 1: ./stackoverflow_brmin
Time (mean ± σ): 1.037 s ± 0.010 s [User: 1.036 s, System: 0.001 s]
Range (min … max): 1.022 s … 1.058 s 15 runs
Benchmark 2: ./stackoverflow_master
Time (mean ± σ): 1.044 s ± 0.005 s [User: 1.043 s, System: 0.001 s]
Range (min … max): 1.037 s … 1.055 s 15 runs
Summary
./stackoverflow_brmin ran
1.01 ± 0.01 times faster than ./stackoverflow_master
Let's see if it breaks anything.
I don't understand quite what this does, but I like the results! (Why is it still in progress?)
Hey, you wrote the original code, I expected there to be some story around this.
It rearranges basic blocks to reduce jumps. For example, a catch block in the middle of a function can be reordered to after the epilogue, so the locality is better. It still has some issues with loops in the current form.
while (true)
{
if (f())
{
g();
break;
}
else
h();
}
static foreach (i; 0 .. 64) a[i] = i;
return;
The else block will be moved to the end of the function despite the long distance, which gives atrocious locality to the loop.
Lstart:
while (true)
{
if (f())
{
g();
break;
}
else
goto Lend;
}
static foreach (i; 0 .. 64) a[i] = i;
return;
Lend:
h();
goto Lstart;
Statically it's impossible to determine the hot branch at runtime, so such motion is unacceptable. I'm considering limiting the distance of motion, or just inhibiting motion inside a loop to fix this. I wonder if there are better ways to do this.
We did this for the assert failure branches sometime ago of moving them past the function epilogue.
Thanks for your pull request and interest in making D better, @limepoutine! We are looking forward to reviewing it, and you should be hearing from a maintainer soon. Please verify that your PR follows this checklist:
- My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
- My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
- I have provided a detailed rationale explaining my changes
- New or modified functions have Ddoc comments (with
Params:andReturns:)
Please see CONTRIBUTING.md for more information.
If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.
Bugzilla references
Your PR doesn't reference any Bugzilla issue.
If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.
Testing this PR locally
If you don't have a local development environment setup, you can use Digger to test this PR:
dub run digger -- build "master + dmd#22158"
Added a completely nonsensical cost model to prevent pathological cases. Anyone has better ideas?