runtime (gc_blocks.go): make sweep branchless
Instead of looping over each block, we can use bit hacks to operate on an entire state byte. This deinterleaves the state bits in order to enable these tricks.
Performance in the problematic go/format benchmark:
│ conservative.txt │ conservative-branchless.txt │ boehm.txt │
│ sec/op │ sec/op vs base │ sec/op vs base │
Format/array1-10000 30.46m ± 2% 28.93m ± 2% -5.01% (p=0.004 n=20) 22.13m ± 5% -27.35% (p=0.000 n=20)
│ conservative.txt │ conservative-branchless.txt │ boehm.txt │
│ B/s │ B/s vs base │ B/s vs base │
Format/array1-10000 2.027Mi ± 2% 2.136Mi ± 3% +5.41% (p=0.004 n=20) 2.789Mi ± 5% +37.65% (p=0.000 n=20)
│ conservative.txt │ conservative-branchless.txt │ boehm.txt │
│ B/op │ B/op vs base │ B/op vs base │
Format/array1-10000 4.663Mi ± 0% 4.663Mi ± 0% ~ (p=1.000 n=20) 6.979Mi ± 0% +49.68% (p=0.000 n=20)
│ conservative.txt │ conservative-branchless.txt │ boehm.txt │
│ allocs/op │ allocs/op vs base │ allocs/op vs base │
Format/array1-10000 204.3k ± 0% 204.3k ± 0% ~ (p=1.000 n=20) 0.0k ± 0% -100.00% (p=0.000 n=20)
We could probably squeeze more performance out of this by making the state masks bigger, but we would need popcnt on the target machine for that to really work.
Please resolve merge conflicts now @niaow since #5102 was merged.
The AVR tests are broken due to an unrelated issue I will need to fix in a separate PR first #5111
@niaow can you please rebase this branch now?
I'll try to review this today.
This PR is once again ready for rebase @niaow
I am going to move the counters in here now that #5105 is merged. I was going to do that in a separate PR but I think it is simpler to just do it here.
Finished moving the counters. Updated perf numbers:
│ conservative.txt │ conservative-new.txt │ boehm.txt │
│ sec/op │ sec/op vs base │ sec/op vs base │
Format/array1-10000 23.79m ± 1% 21.36m ± 3% -10.21% (p=0.000 n=20) 20.80m ± 2% -12.57% (p=0.000 n=20)
│ conservative.txt │ conservative-new.txt │ boehm.txt │
│ B/s │ B/s vs base │ B/s vs base │
Format/array1-10000 2.594Mi ± 1% 2.890Mi ± 4% +11.40% (p=0.000 n=20) 2.971Mi ± 2% +14.52% (p=0.000 n=20)
Finished moving the counters. Updated perf numbers:
│ conservative.txt │ conservative-new.txt │ boehm.txt │ │ sec/op │ sec/op vs base │ sec/op vs base │ Format/array1-10000 23.79m ± 1% 21.36m ± 3% -10.21% (p=0.000 n=20) 20.80m ± 2% -12.57% (p=0.000 n=20) │ conservative.txt │ conservative-new.txt │ boehm.txt │ │ B/s │ B/s vs base │ B/s vs base │ Format/array1-10000 2.594Mi ± 1% 2.890Mi ± 4% +11.40% (p=0.000 n=20) 2.971Mi ± 2% +14.52% (p=0.000 n=20)
This is looking good!
@dgryski waiting on your feedback...
I tried to make this a bit clearer, see if this looks okay.