Improve default optimizer steps sequence in `via-ir`
The default optimizer sequence we have at the moment is quite large, and thus causes bloating in compilation times.
libsolidity/interface/OptimiserSettings.h
static char constexpr DefaultYulOptimiserSteps[] =
"dhfoDgvulfnTUtnIf" // None of these can make stack problems worse
"["
"xa[r]EscLM" // Turn into SSA and simplify
"cCTUtTOntnfDIul" // Perform structural simplification
"Lcul" // Simplify again
"Vcul [j]" // Reverse SSA
// should have good "compilability" property here.
"Tpeul" // Run functional expression inliner
"xa[rul]" // Prune a bit more in SSA
"xa[r]cL" // Turn into SSA again and simplify
"gvif" // Run full inliner
"CTUca[r]LSsTFOtfDnca[r]Iulc" // SSA plus simplify
"]"
"jmul[jul] VcTOcul jmul"; // Make source short and pretty
The outcome should hopefully be a shorter sequence, that achieves the same level of optimization, while reducing the compile times (in the context of the optimizer pipeline). Special attention should be paid to steps that destroy the SSA form of its input (better suited for optimizing), and thus cause subsequent steps to perform poorly.
Just to elaborate a bit: the current sequence transforms to SSA and back in a loop. This is partially based on the legacy-code-transform fact that the number of variables translates rather directly to required stack slots. The new code transform does not suffer from this and should be able to generate efficient code directly from SSA.
Based on that, we should, generally, be able to transform to SSA and stay in SSA, up until code generation. However, that would require to ensure that none of the steps that perform meaningful optimizations destroy the SSA property.
Actually changing any potential such steps to preserve SSA form I'd consider a separate tasks. This first task should merely investigate what can be achieved with modifying the sequence without actually touching the steps, keeping in mind that transforming back from SSA should not add value and contribute to "compilability" relative to the current code transform anymore.
This issue has been marked as stale due to inactivity for the last 90 days. It will be automatically closed in 7 days.
While we're at it, we should double-check if the effects of the StackCompressor these days are actually still positive or whether we should just remove it (simplifying our entire pipeline)
For future reference, here are the initial results I showed last week: solc-seqbench-intro-and-initial-results.md.
And here's a bit more analysis:
- Now with plots and tables that compare all sequences on the same contract or all contracts with the same sequence.
- I added an extra test sequence:
small-loops. It's just the default one with the big loop replaced with looping each component individually.
I haven't analyzed it all yet so here are the plots and some casual observations. I still need to go over those plots in more detail.
Results
Contract with different sequences
Sequence with different contracts
Observations
Contract deposit_contract
-
single-passtakes 2x less time and gets only slightly worse results. - others get almost the same results and timing is very close (but
small-loopsbeatsdefault, which beatsalways-ssa-min) -
small-loopsexecutes significantly fewer steps thandefaultandalways-ssa-min, but timing differs only minimally -
small-loopsconverges visibly slower, especially in terms of bytecode size.defaultandalways-ssa-minwould beat it if we stopped as soon as they converge - Time wasted after convergence: half for all but
single-passin execution, in bytecode size half to 1/4. - Improvements seem to happen sharply at specific steps, and these seem to be the same steps in each sequence. The first ones are:
c,M,e,l,T. - Almost all the improvement seems to happen in the first cycle of the main loop.
Contract ramanujan_pi
-
single-passconverges very quickly and is only a little worse in terms of execution time. For bytecode size it's even same assmall-loops.-
small-loopstakes 4x the time to converge and the other two 10x the time.
-
Since the single-pass sequence looks very promising, I decided to see how it would fare in our CI before I analyze the results in more detail. Here's a test PR with changes #14887 and I posted benchmarks in comments.
Results look promising. 17-46% decrease in total compilation time. Gas benchmarks also aren't too bad. While I can see up to 25% increase in gas in a few tests, the majority is impacted very little and benchmark diffs for external tests look much more reasonable - most of them below 1.5% for both deployment and runtime. Also, in many cases costs actually decrease. ENS is an outlier with bytecode size increasing 5%, though on the other hand ElementFI had bytecode decrease by 4%.
Overall I think that we might be able to tweak this sequence enough to get results very close to the current default.
Also, worth noting that the running time decrease we're seeing with this sequence may very likely be the bottom line of what we can achieve by tweaking the sequence. The plots show that most of the gains happen in that first pass so we'll need to keep most of it in one form or another.
no-cse sequences
Here's the data for sequences I analyzed yesterday. This time I tried to remove some c (CommonSubexpressionEliminator) and j (ExpressionJoiner) steps that seemed to degrade the results for deposit_contract. The results are mixed. It helps in some cases, makes things worse in others.
Contract with different sequences
Sequence with different contracts
And here's some analysis of the data I gathered so far. This is more or less what I presented on the call today.
Analysis
- All the sequences I checked out so far differ very little in terms of final bytecode size and runtime cost.
- While small, the differences are not necessarily negligible. The thing is that none of them is consistently always better or always worse than others. It depends on the input.
- The thing that does differ is primarily the optimization time:
- The single-pass sequences clearly win over the multi-pass sequences here. Most of the gain is achieved on the first pass. Further passes do improve the result in many cases but only minimally, which is not worth the 2-3x increase in optimization time.
-
always-ssa-minis a definitely worse than thedefaultsequence here. It usually takes just as much time and there were cases where it was almost 2x slower for almost no additional gain. - The
no-csevariants are generally slightly faster than their equivalents, due to having fewer steps. This indicates that dropping some steps may be worthwhile.
-
ramanujan_piis the contract that converges the slowest. Bytecode size converges only after 6 repetitions of the main loop and runtime cost after 4. Stopping after just a single pass gives worse results, though it's worth noting that subsequent passes make things worse before they make them better so ultimately the single-pass sequences do not end up being as much worse as one might expect. There may be a way to achieve results comparable with multi-pass in a more efficient way. - The
small-loopssequence in some cases finishes significantly faster thandefault(~60% faster forramanujan_pi) and never much longer. Still, it spends a lot of time not improving, just likedefault, so the single-pass variants generally seem more promising.
Conclusions
It seems that a single-pass sequence is the way to go. Such sequences are significantly faster than default and very close in terms of optimization. Probably can even be better with enough tweaking.
Even if we don't manage to consistently beat default with such a sequence, a safe option would be to expose the MaxRounds constant for the main loop as an optimization parameter and set it to 1 by default. In most cases this would give enough optimization and users could decide on their own whether they want to spend more time to achieve minimally better results.
Some more loose observations from looking at seqbench results recently:
- Cleanup sequence has almost no effect on bytecode size or runtime cost.
- On the other hand the part between the main loop and the cleanup sequence (described as "make source short and pretty") is pretty crucial. This is usually what brings the final cost/size to the lowest point, especially in the single-pass sequences. The
jmulpart seems to have a significant effect on bytecode size. - There seems to be no benefit in looping parts of sequence individually (i.e. like in the
small-loopssequence)- Only "SSA plus simplify" seems to give some limited bytecode size gains with repetition, e.g. in
ramanujan_picontract.
- Only "SSA plus simplify" seems to give some limited bytecode size gains with repetition, e.g. in
- The lowest bytecode size tends to be reached after
TUsteps but the current sequence always reverses this gain withc. Especially in the cleanup part that seeem undesirable.
the-good-parts sequence
This is a sequence I put together by starting with single-pass and then trying to remove the parts that did not contribute much to the final result. Then refined it on the ramanujan_pi contract by distilling the bits of the default sequence that improved its bytecode size with every repetition.
Contract with different sequences
Sequence with different contracts
And here's the overall comparison of all the sequences I tested so far.
Remarks
- Compilation time was measured in a different way than optimization time so it may not add up. Take them as very approximate.
- Percentages are based on the initial, unoptimized value (keep that in mind when comparing with percentages from e.g.
gas_diff_stats.py, which does not know the original value). - The
mincolumn is the minimum reached at any point in sequence. It's reported when it's lower then the final value.
Contract deposit_contract
| runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
|---|---|---|---|---|---|---|---|---|---|
default |
-15.5% | -37.0% | -39.4% | -20.1% | -21.8% | 343 ms | 207 ms | 430 ms | |
the-good-parts |
-15.3% | -39.3% | -21.7% | 160 ms | 203 ms | 259 ms | |||
single-pass |
-14.1% | -14.9% | -34.4% | -38.9% | -18.7% | -21.5% | 154 ms | 184 ms | 304 ms |
always-ssa-min |
-15.5% | -15.6% | -36.3% | -39.0% | -19.7% | -21.5% | 363 ms | 174 ms | 466 ms |
always-ssa |
-15.2% | -36.7% | -38.7% | -20.0% | -21.3% | 236 ms | 184 ms | 377 ms | |
default-no-cse |
-15.5% | -39.4% | -21.8% | 298 ms | 210 ms | 394 ms | |||
single-pass-no-cse |
-14.8% | -38.4% | -21.0% | 134 ms | 182 ms | 255 ms | |||
small-loops |
-15.4% | -15.6% | -36.2% | -39.2% | -19.7% | -21.6% | 282 ms | 178 ms | 441 ms |
Contract FixedFeeRegistrar
| runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
|---|---|---|---|---|---|---|---|---|---|
default |
-5.6% | -33.8% | -34.3% | -31.0% | -31.5% | 187 ms | 170 ms | 193 ms | |
the-good-parts |
-5.6% | -32.4% | -34.3% | -29.7% | -31.5% | 92 ms | 92 ms | 135 ms | |
single-pass |
-2.9% | -3.0% | -32.9% | -34.5% | -30.1% | -31.6% | 102 ms | 87 ms | 153 ms |
always-ssa-min |
-3.0% | -34.1% | -35.1% | -31.2% | -32.1% | 382 ms | 158 ms | 248 ms | |
always-ssa |
-2.9% | -3.0% | -32.9% | -34.5% | -30.1% | -31.6% | 130 ms | 88 ms | 188 ms |
default-no-cse |
-5.6% | -32.6% | -34.1% | -29.8% | -31.2% | 147 ms | 78 ms | 195 ms | |
single-pass-no-cse |
-3.1% | -35.1% | -35.3% | -32.2% | -32.3% | 133 ms | 94 ms | 160 ms | |
small-loops |
-2.9% | -3.0% | -33.4% | -35.0% | -30.6% | -32.1% | 149 ms | 84 ms | 196 ms |
Contract prbmath_unsigned
| runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
|---|---|---|---|---|---|---|---|---|---|
default |
-44.4% | -30.8% | -30.3% | 550 ms | 596 ms | 600 ms | |||
the-good-parts |
-44.5% | -29.6% | -29.1% | 249 ms | 444 ms | 409 ms | |||
single-pass |
-44.4% | -30.7% | -30.8% | -30.2% | -30.3% | 259 ms | 420 ms | 410 ms | |
always-ssa-min |
-44.4% | -30.8% | -30.3% | 784 ms | 564 ms | 827 ms | |||
always-ssa |
-44.4% | -30.8% | -30.3% | 455 ms | 498 ms | 620 ms | |||
default-no-cse |
-44.4% | -29.5% | -29.0% | 510 ms | 613 ms | 638 ms | |||
single-pass-no-cse |
-44.3% | -29.7% | -29.2% | -29.3% | 241 ms | 433 ms | 614 ms | ||
small-loops |
-44.4% | -30.6% | -30.7% | -30.1% | 472 ms | 403 ms | 626 ms |
Contract ramanujan_pi
| runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
|---|---|---|---|---|---|---|---|---|---|
default |
-67.2% | -67.7% | -39.2% | -39.9% | -36.2% | -36.8% | 520 ms | 129 ms | 622 ms |
the-good-parts |
-67.2% | -67.5% | -42.9% | -39.6% | 135 ms | 134 ms | 233 ms | ||
single-pass |
-64.1% | -64.8% | -30.1% | -33.5% | -27.8% | -30.9% | 109 ms | 109 ms | 170 ms |
always-ssa-min |
-66.8% | -67.3% | -40.2% | -37.1% | 531 ms | 114 ms | 602 ms | ||
always-ssa |
-65.7% | -66.4% | -30.8% | -42.2% | -28.4% | -39.0% | 160 ms | 124 ms | 232 ms |
default-no-cse |
-67.7% | -37.3% | -39.9% | -34.5% | -36.8% | 542 ms | 129 ms | 540 ms | |
single-pass-no-cse |
-65.0% | -25.9% | -32.1% | -23.9% | -29.6% | 89 ms | 118 ms | 151 ms | |
small-loops |
-66.3% | -67.0% | -28.8% | -31.9% | -26.5% | -29.5% | 303 ms | 109 ms | 379 ms |
Contract strings
| runtime gas | min | bytecode size | min | creation gas | min | optimization time | compilation time (unoptimized) | compilation time (optimized) | |
|---|---|---|---|---|---|---|---|---|---|
default |
-11.9% | -12.1% | -26.1% | -27.7% | -24.6% | -26.1% | 287 ms | 169 ms | 430 ms |
the-good-parts |
-11.9% | -25.9% | -27.7% | -24.3% | -26.1% | 134 ms | 185 ms | 239 ms | |
single-pass |
-11.6% | -11.8% | -23.7% | -27.7% | -22.3% | -26.1% | 121 ms | 161 ms | 219 ms |
always-ssa-min |
-11.9% | -12.1% | -25.4% | -25.9% | -23.9% | -24.4% | 330 ms | 173 ms | 406 ms |
always-ssa |
-11.8% | -12.0% | -23.1% | -29.8% | -21.7% | -28.1% | 185 ms | 178 ms | 295 ms |
default-no-cse |
-12.1% | -26.2% | -24.7% | 311 ms | 182 ms | 389 ms | |||
single-pass-no-cse |
-11.9% | -25.3% | -23.9% | 108 ms | 169 ms | 212 ms | |||
small-loops |
-11.6% | -11.8% | -25.4% | -27.7% | -23.9% | -26.1% | 255 ms | 164 ms | 349 ms |
Final summary for the-good-parts sequence
Comparison with default
- most between -1% and +3%
- one outlier at -6% (
ramanujan_pi.sol) - several outliers between +4% and +9% (including
base64.solanderc20.sol)
- bytecode size changes between -1.3% and +2.1%
- runtime cost changes between 0% and +0.3% with one outlier at +2.1% (uniswap)
- effect on legacy pipeline:
- bytecode size changes between 0% and +0.2%
- negligible runtime cost changes
- Improvement between -15% and -45%.
- Timing improved in all cases.
- Runtime of bytecode comparison job for optimized IR decreased by 25%.
seqbench results (values where the new sequence is worse in bold):
default runtime gas |
the-good-parts runtime gas |
default bytecode size |
the-good-parts bytecode size |
default compilation time (optimized) |
the-good-parts compilation time (optimized) |
|
|---|---|---|---|---|---|---|
deposit_contract |
-15.5% | -15.3% | -37.0% | -39.3% | 430 ms | 259 ms |
FixedFeeRegistrar |
-5.6% | -5.6% | -33.8% | -32.4% | 193 ms | 135 ms |
prbmath_unsigned |
-44.4% | -44.5% | -30.8% | -29.6% | 600 ms | 409 ms |
ramanujan_pi |
-67.2% | -67.2% | -39.2% | -42.9% | 622 ms | 233 ms |
strings |
-11.9% | -11.9% | -26.1% | -25.9% | 430 ms | 239 ms |
Conclusions
The new sequence clearly beats the single-pass sequence. Comparison with default is more mixed.
The outliers seem a bit concerning. Still, they're not extreme and go both ways so the new sequence is not consistently worse in any metric. In many cases it's a little better.
Overall results seem close enough that we're probably fine using it. I'm pretty sure the sequence can still be refined a little if we spend more time on it. All the sequences I tested have cases where they're better than default by a few percent, most of the time without doing as many repetitions. The trick is to get one that can do it consistently on most input.
Effects of removing StackCompressor
The results are oddly mixed: from completely neutral to very negative.
On one hand, I see almost no change in gas usage in our semantic tests.
I also ran seqbench on the default and the-good-parts sequences. Literally zero difference other than slightly compilation times (but still probably within the margin of error). Exactly the same bytecode sizes and execution gas. I did spot checks on the generated Yul and files are identical. To the point that I'm questioning whether my patched solc really worked correctly, but so far I see no evidence that it did not.
On the other hand, results of external tests are terrible (ir-optimize-evm+yul):
| project | bytecode_size | deployment_gas | method_gas |
|---|---|---|---|
| brink | 0% |
||
| colony | 0% |
||
| ens | +0.49% ❌ |
0% |
0% |
| perpetual-pools | +9.09% ❌ |
+9.19% ❌ |
+1.6% ❌ |
| pool-together | +0.86% ❌ |
||
| uniswap | +4.05% ❌ |
+3.59% ❌ |
+5.87% ❌ |
| yield_liquidator | +2.03% ❌ |
+2.29% ❌ |
+0.08% ❌ |
| zeppelin | +0.01% ❌ |
+0.01% ❌ |
+0.02% ❌ |
In some cases zero difference, but in others up to 10% increase in bytecode size and 6% increase in gas usage. And no cases where results improved.
Finally, the change caused some new StackTooDeep issues. Only in a few semantic/syntax tests in our test suite but also in 7 of the 15 external tests we have.
the-good-parts-mk2 sequence
Comparison with default
- same or better as
the-good-parts. Still mostly between -1% and +3% with outliers at -6% and 4-9%, but also has a number of cases where a 2-4% increase became 0-1%.
- bytecode size changes between -2.4% and +1.65%: better than
the-good-parts - runtime cost changes between 0% and +0.15% with one outlier at +2% (uniswap): better than
the-good-parts - effect on legacy pipeline in the same range as
the-good-parts(bytecode size changes between 0% and +0.3% and negligible runtime cost changes)
seqbench results
(values that differ from the-good-parts in bold):
default runtime gas |
the-good-parts-mk2 runtime gas |
the-good-parts runtime gas |
default bytecode size |
the-good-parts-mk2 bytecode size |
the-good-parts bytecode size |
default compilation time |
the-good-parts-mk2 compilation time |
the-good-parts compilation time |
|
|---|---|---|---|---|---|---|---|---|---|
deposit_contract |
-15.5% | -15.3% | -15.3% | -37.0% | -39.4% | -39.3% | 430 ms | 248 ms | 259 ms |
erc20 |
-4.2% | -4.2% | -4.1% | -37.0% | -35.9% | -34.2% | 149 ms | 113 ms | 94 ms |
FixedFeeRegistrar |
-5.6% | -5.6% | -5.6% | -33.8% | -32.8% | -32.4% | 193 ms | 120 ms | 135 ms |
prbmath_unsigned |
-44.4% | -44.5% | -44.5% | -30.8% | -29.6% | -29.6% | 600 ms | 403 ms | 364 ms |
ramanujan_pi |
-67.2% | -67.2% | -67.2% | -39.2% | -42.9% | -42.9% | 622 ms | 238 ms | 233 ms |
strings |
-11.9% | -11.9% | -11.9% | -26.1% | -27.2% | -25.9% | 430 ms | 236 ms | 239 ms |
Conclusions
The new sequence is same or better than the-good-parts in nearly all cases. Speed is also pretty much the same.
It's still not as good as default but also still much faster.
Comparison of the-good-parts sequence variants
seqbench results on cancun
Contracts
New contracts:
-
erc20(ERC20 stub from our repo) -
oz-erc20(actual ERC20 from OpenZeppelin) -
chainsfrom our repo (pathologically slow contract from our timing benchmarks)
Comparison between contracts for each sequence
Note that I excluded chains.sol here because the range of values makes it hard to compare with other contracts.
Which variant of the-good-parts is the best?
Quality of optimization
- External tests: in almost all cases mk3 is a minimal improvement over mk2. mk2 was also better than mk1.
- seqbench: bytecode size and runtime gas differences are usually fractions of a percent. In the few cases where they are bigger mk2 and mk3 beat mk1. mk2 is sometimes minimally better than mk3.
- Tests in the repo: mk2 is a small improvement over mk1 in some cases. mk3 is very similar to mk2.
- Optimized output in the repo: mk3 addresses some cases where code snippets from the repo were not fully optimized. Still not all of them. Haven't compared mk1 and mk2 in that regard.
Overall: mk3 >= mk2 >= mk1. But differences are small.
Compilation time
- External tests: Neither mk3 nor mk2 seems consistently faster in CI. There are differences and sometimes they seem significant, but the timing variance in CI is so big that I think they are still the result of that variance.
- seqbench: mk2 is only slightly slower than mk1. mk3 is more visibly slower than mk2, but still very close. Compared to
defaultthey can be considered pretty much the same. The difference comes from each version having a few more steps. - Timing benchmark in the repo: differences between mk2 and mk3 are in favor of the latter, but are very small, probably smaller than normal variance.
Overall: mk1 >= mk2 >= mk3. But again, differences are small.
Conclusion
All of the sequences are pretty close. I'd recommend mk3 because it has a small edge in terms of optimization quality and the difference in timing is not significant. It's also the one I spent the most time analyzing so I don't see much point in going back to the other two. Still, it seems that improvements were marginal at best.
the-good-parts-mk3 vs default
Quality of optimization
- Still mostly between -1% and +3% with outliers at -5% and 4-9%.
- bytecode size changes between -1.85% and +1.65%.
- runtime cost between 0% and +0.15%.
- Note that the +2% outlier for
uniswapwe've seen with mk1 is gone, but this is probably due to switch to Cancun, not improvements in the sequence.
- Note that the +2% outlier for
- effect on legacy pipeline: bytecode size between 0% and +0.3%, runtime cost change negligible.
seqbench:
- runtime cost improvement differences are negligible in all cases.
- bytcode size improvement differs by 1-2% either way so it's wash. The only real outlier being the pathological
chans.solwheredefaultincreases the size by 252.3%, whilethe-good-parts-mk3"only" by 227.2%.
Compilation time
- New sequence is 10-65% faster, depending on the test (with
colonybeing a single outlier at +6%).
seqbench:
- Compilation time decreases by 1/3 to 2/3 depending on the test.
- The compilation time is halved in all three cases.