Performance improvements of individual Yul Optimizer steps
A catch-all issue for the specific ideas for performance improvement in individual optimizer steps + a place to post profiling results and discuss them.
- [x] FullInliner's
BodyCopiertakes ~60% of inliner's running time. - [x] Are two passes of FullInliner necessary?
- [x] Oversized time impact of UnusedPruner
- [ ] ...
Are two passes of the FullInliner necessary?
FullInliner does two passes through the code, potentially duplicating some work. The first pass (inlining only tiny functions) seems to be a subset of the second so I decided to check if it makes that much differnce and how it impacts running time.
Results
Removing the first pass did shave off ~1 s from inliner's running time in my benchmarks (on a 120 s compilation of Uniswap).
Here's what it looks like in detail:
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100% | 4.183 s | 810 | run() |
| 78% | 3.085 s | 810 | ├─run(Pass::InlineRest) |
| 13% | 0.561 s | 810 | ├─run(Pass::InlineTiny) |
| 14% | 0.458 s | 810 | └─FullInliner() |
| 4% | 0.160 s | 810 | └─ReferencesCounter::countReferences() |
So it may be closer to 0.5 s.
Gas impact
I don't have much data on runtime cost, but what I'm seeing is not bad. In most tests in our repo the impact is slightly negative but minimal. At the same time there are a few specific tests where we it goes down by a lot.
I created #15509 to see external test results, but runtime gas results are very sparse. It does have a very positive impact on bytecode size though.
ir-optimize-evm+yul:
| project | bytecode_size | deployment_gas | method_gas |
|---|---|---|---|
| brink | 0% |
||
| colony | +0% |
||
| elementfi | -0.27% ✅ |
||
| ens | -1.77% ✅ |
-0.66% ✅ |
-0% |
| euler | -0.14% ✅ |
||
| gnosis | |||
| gp2 | +0.15% ❌ |
||
| pool-together | +0.22% ❌ |
||
| uniswap | -4.62% ✅ |
||
| yield_liquidator | -1.54% ✅ |
-1.62% ✅ |
-0.11% ✅ |
| zeppelin | -2.62% ✅ |
Conclusions
- Mild improvement in terms of running time.
- Surprisingly nice improvement in terms of bytecode size.
- Impact on runtime gas is uncertain. We need more data.
FullInliner's BodyCopier takes ~60% of inliner's running time.
Profiling FullInliner::tryInlineStatement()
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100% | 4.183 s | 810 | run() |
| 68% | 2.860 s | 1566132 | └─tryInlineStatement() |
| 54% | 2.278 s | 36552 | ├─performInline() |
| 48% | 1.990 s | 36552 | │ ├─BodyCopier |
| 25% | 1.060 s | 772025 | │ │ └─NameDispenser::newName() |
| 5% | 0.216 s | 62898 | │ └─newVariable() |
| 4% | 0.187 s | 62898 | │ └─NameDispenser::newName() |
| 13% | 0.536 s | 909031 | └─shallInline() |
| 11% | 0.468 s | 156399 | └─ReferencesCounter::countReferences() |
The numbers come from the same Uniswap run as in https://github.com/ethereum/solidity/issues/15508#issuecomment-2407974008 above. Just note that they are from both passes combined.
@ekpyron had a hypothesis that it's NameDispenser::newName() that takes most of of the time in BodyCopier. This is a bit tricky to verify because a single call of this function is relatively lightweight. From the numbers above it looks like it accounts only for half of BodyCopier's running time. However, if you look at the table with steps at the bottom of this comment, you'll see that the normal running time of FullInliner is ~3 s, while here it blew up to ~4 s. I think the missing second is the overhead of the profiler probe I placed inside the BodyCopier's operator(). With ~800k invocations it adds up. This would confirm the hypothesis.
Optimizing NameDispenser::newName() the quick and dirty way
I suspected that there may be a way to address the bottleneck locally and get the speed up already, even before we're done with the YulName stuff.
https://github.com/ethereum/solidity/blob/7893614a31fbeacd1966994e310ed4f760772658/libyul/optimiser/NameDispenser.cpp#L47-L57
Identifying which part of newName() is the heaviest was more difficult than I expected though. At first I suspected illegalName(), but profiling led me to believe that it's the string conversion instead. I wrote a faster, optimized conversion but it turned out to be a red herring. It was not the bottleneck. In the end I had to resort to artificially repeating each part multiple times in a loop to find out how heavy are the parts relative to each other:
| Time % | Element |
|---|---|
| 61.9% | YulString conversion |
| 22.0% | illegalName() |
| 9.4% | std::string concatenation |
| 3.7% | YulString::str() |
| 3.0% | std::to_string() |
Looks like it's mostly in YulString and we'll have to wait for #15215 after all.
Impact of NameDispenser on optimizer as a whole
I measured this by replacing the implementation with this:
static size_t counter = 0;
++counter;
return YulString("___UNIQUE_NAME___" + std::to_string(counter));
The result is that the total running time of the optimizer when compiling Uniswap went down from 60s to 53s. Here's a detailed breakdown between steps:
| Time % (before) | Time % (after) | Time (before) | Time (after) | Step |
|---|---|---|---|---|
| 1.1% | 0.9% | 0.685 s | 0.469 s | ExpressionSplitter |
| 2.8% | 2.7% | 1.693 s | 1.429 s | LoopInvariantCodeMotion |
| 3.6% | 2.1% | 2.200 s | 1.088 s | FunctionSpecializer |
| 3.7% | 3.7% | 2.240 s | 1.981 s | UnusedStoreEliminator |
| 3.9% | 4.1% | 2.327 s | 2.192 s | ConditionalSimplifier |
| 4.1% | 4.4% | 2.485 s | 2.307 s | LoadResolver |
| 4.6% | 4.7% | 2.795 s | 2.496 s | DeadCodeEliminator |
| 4.7% | 3.8% | 2.843 s | 2.006 s | SSATransform |
| 5.1% | 5.3% | 3.090 s | 2.797 s | ConditionalUnsimplifier |
| 5.3% | 3.7% | 3.193 s | 1.947 s | FullInliner |
| 8.2% | 8.8% | 4.963 s | 4.649 s | LiteralRematerialiser |
| 8.7% | 8.8% | 5.246 s | 4.626 s | UnusedAssignEliminator |
| 8.7% | 8.6% | 5.252 s | 4.525 s | UnusedPruner |
| 9.8% | 10.7% | 5.902 s | 5.672 s | ExpressionSimplifier |
| 20.7% | 22.7% | 12.497 s | 11.991 s | CommonSubexpressionEliminator |
| 100.0% | 100.0% | 60.385 s | 52.862 s | TOTAL |
The steps which became faster by 0.5 s or more are marked in bold. I omitted steps that take less than 1% of running time since there's nothing interesting happening here.
Conclusions
- The
NameDispenserseems to be the biggest bottleneck of the inliner. - Eliminating
NameDispenserwill substantially speed up not only the inliner but also many other steps. - The overhead of
NameDispenserseems to be tied mostly toYulStringand not some simpler things that could be quickly optimized locally. We do need #15215 for this.
Oversized time impact of UnusedPruner
Context
UnusedPruner is relatively simple, but takes a surprisingly large share of the total optimization time.
Here's what it looks like for Uniswap:
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100.0% | 62.251 s | 38316 | TOTAL |
| 20.5% | 12.731 s | 3281 | CommonSubexpressionEliminator |
| 9.5% | 5.898 s | 1391 | ExpressionSimplifier |
| 8.8% | 5.473 s | 1350 | UnusedAssignEliminator |
| 8.4% | 5.251 s | 2700 | UnusedPruner |
With total compilation time being ~120 s.
Top-level breakdown
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100% | 5.286 s | 2700 | run() |
| 100% | 5.285 s | 2700 | runUntilStabilisedOnFullAST() |
| 72% | 3.793 s | 2719 | runUntilStabilised() |
| 34% | 1.821 s | 741371 | operator()(Block) |
| 27% | 1.409 s | 4551 | constructor |
| 27% | 1.408 s | 4551 | ReferencesCounter::countReferences() |
| 15% | 0.782 s | 2700 | SideEffectsPropagator::sideEffects() |
| 8% | 0.444 s | 2700 | MSizeFinder::containsMSize() |
| 5% | 0.284 s | 2700 | CallGraphGenerator::callGraph() |
Pretty much all of the time is spent in general utilities and in operator(), which contains the main logic.
Note that runUntilStabilised() here has more calls than run(). It's because it's also invoked by StackCompressor. The time spent in those calls is negligible (0.04 s), so I'm omitting it.
operator()
This is the main custom bit in the pruner.
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100% | 5.767 s | 2700 | run() |
| 74% | 4.273 s | 2719 | runUntilStabilised() |
| 39% | 2.247 s | 741371 | operator()(Block) |
| 13% | 0.766 s | 295388 | FunctionDefinition case |
| 1% | 0.064 s | 3010287 | VariableDeclaration case |
| 0% | 0.024 s | 914453 | ExpressionStatement case |
| 0% | 0.015 s | 741371 | removeEmptyBlocks() |
The operator is relatively lightweight, but invoked a lot so profiling adds visible overhead (~0.5 s). However, even after subtracting the overhead, there's still almost 1 s unaccounted for. I did not manage to explain where it's going. It looks like it must be coming from ASTModifier or the for loop itself, but when profiled in isolation those two do not take nearly that much time.
The way the visitation is done does seem a bit redundant - we loop over statements in every block, but then run the original operator, which visits all that again. This is because the pruner needs to delete statements or replace them with statements of a different type, which is difficult to do in normal visitation. I spent a moment trying to rewrite, but ultimately gave up.
Redundant reruns
One potential source of redundancy is the code that loops stable. I checked how often that happens and how much time it takes:
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100% | 3.659 s | 2719 | runUntilStabilised() |
| 65% | 2.378 s | 2719 | first pass |
| 35% | 1.279 s | 1384 | remaining passes |
So it seems that the loop runs ~2 times on average and that the second and later passes are faster than the first one. This would indicate that the second pass rarely does anything. I did a quick experiment by hard-coding it to one pass to see if it affects the results. Unfortunately, I do see a significant number of left-over unused statements in Yul tests, so this does not seem to be a viable simplification.
Impact of general utilities on whole compilation
I checked how much time the utility classes consume cumulatively over the whole compilation (i.e. in all of the steps and the Yul->EVM transform), to see if focusing on optimizing any single one of them would make sense.
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100.0% | 60.691 s | Optimizer steps (not including Yul->EVM) |
|
| 12.2% | 7.378 s | TOTAL |
|
| 4.3% | 2.622 s | 8411 | SideEffectsPropagator::sideEffects() |
| 3.6% | 2.169 s | ReferencesCounter::countReferences() |
|
| 2.9% | 1.781 s | 88083 | ReferencesCounter::countReferences(Block) |
| 0.6% | 0.387 s | 156399 | ReferencesCounter::countReferences(FunctionDefinition) |
| 0.0% | 0.001 s | 785498 | ReferencesCounter::countReferences(Expression) |
| 2.6% | 1.595 s | 13767 | CallGraphGenerator::callGraph() |
| 1.6% | 0.992 s | 5963 | MSizeFinder::containsMSize(Block) |
The one that might be conceptually simplest to address is MSizeFinder - since the optimizer never introduces msize itself, we could save ~1 s by checking it once before optimization and just passing the flag on via context.
Overall, their combined impact seems non-negligible, but it's spread over several of them. Focusing on optimizer steps first is probably more worthwhile.
Conclusions
- Most time spent in general utilities, especially finding references to identifiers and calculating side-effects.
- 1/5 of time is unaccounted for (in
operator()). Could be the overhead of visitation or something else. - Repeating until stable takes 1/3 of time and is seemingly redundant, but removing it actually hurts optimization.
- We may be able to save ~1 s on the whole compilation by finding
msizeonce and storing a flag.
CSE and DataflowAnalyzer slowness (initial investigation)
Timing of steps that use DataflowAnalyzer
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100.0% | 64.251 s | TOTAL OPTIMIZATION TIME | |
| 19.8% | 12.712 s | 3281 | CommonSubexpressionEliminator |
| 7.3% | 4.713 s | 2741 | LiteralRematerialiser |
| 3.6% | 2.288 s | 810 | LoadResolver |
| 0.9% | 0.586 s | 810 | ExpressionSplitter |
| 0.7% | 0.455 s | 270 | EqualStoreEliminator |
| 0.4% | 0.260 s | 810 | Rematerialiser |
That's 21 s in total, ~33% of the optimizer running time. The whole compilation took 109 s (Uniswap).
Structure of CSE
CommonSubexpressionEliminator inherits from DataFlowAnalyzer and does not have all that much code of its own.
In either of them the execution time does not seem to be very concentrated.
These are the functions I could identify as taking the most time, but they account for less than half of the total:
| Time % | Time | Calls | Scope |
|---|---|---|---|
| 100% | 14.191 s | 3281 | CommonSubexpressionEliminator::run() |
| 17% | 2.405 s | 16564075 | DataFlowAnalyzer::handleAssignment() |
| 14% | 1.958 s | 2026470 | DataFlowAnalyzer::popScope() |
| 13% | 1.778 s | 2371866 | DataFlowAnalyzer::clearValues() |
| 9% | 1.239 s | 3281 | SideEffectsPropagator::sideEffects() |
There is also a number of caveats to these numbers:
DataFlowAnalyzer's calls here are from the whole compilation, not just from CSE step (profiler does not distinguish them).handleAssignment()sometimes callsclearValues().- Note that there are a lot of calls, so this timing also includes some overhead (see the difference with CSE in the previous table). The whole compilation here took 116 s.
- It's hard to accurately profile functions that recursively call each other and AST visitor does this a lot, so I could only really profile the standalone helpers.
All the other non-operator methods of DataFlowAnalyzer taken together take less than 0.2 s. I also tried to profile the parts of the operators in isolation, but the time does not add up. The slowest were the ones processing ForLoop and If taking around 0.2-0.4 s each. It's as of the visitation itself was taking a substantial amount of time.
Cost of structural equality
CSE uses SyntacticallyEqual to hash expressions it keeps track of. I checked what's the impact of that part alone. This includes both direct use and lookups/insertions in the map.
| Time | Calls | Scope |
|---|---|---|
| 1.094 s | 14858176 | SyntacticallyEqual |
For some context, CSE's visit(Expression), where some of this hashing happens takes around 2.5 s (excluding the visit() call to base class). That's in a compilation that took 91 s total.
Number of calls
What stands out is the sheer number of calls. Here are the numbers for AST visitation:
| Calls | Scope |
|---|---|
| 15070420 | VariableDeclaration (DFA) |
| 12153239 | Expression (CSE) |
| 2220705 | ExpressionStatement (DFA) |
| 1571439 | Block (DFA) |
| 1165662 | Assignment (DFA) |
| 819967 | If (DFA) |
| 455031 | FunctionDefinition (DFA) |
| 160753 | FunctionDefinition (CSE) |
| 55949 | Switch (DFA) |
| 54470 | ForLoop (DFA) |
And for helpers of DataflowAnalyzer:
| Calls | Scope |
|---|---|
| 32246804 | isSimpleLoad() |
| 19362962 | clearKnowledgeIfInvalidated(Expression) |
| 16564075 | handleAssignment() |
| 16085616 | isKeccak() |
| 14731783 | assignValue() |
| 3675605 | inScope() |
| 2371866 | clearValues() |
| 2026470 | pushScope() |
| 2026470 | popScope() |
| 986845 | joinKnowledge() |
| 913264 | isSimpleStore() |
| 183470 | joinKnowledgeHelper() |
| 156057 | valueOfIdentifier() |
| 46571 | memoryValue() |
| 19775 | storageValue() |
| 9341 | DataFlowAnalyzer() |
| 3006 | keccakValue() |
It looks like these components are just processing a lot of input. DFA is visiting e.g. 15 million variable declarations, but with 8722 invocations of DFA based steps, that's just 1728 declarations per Yul object on average, which seems realistic for large contracts. Also, with 3281 runs of CSE, each run takes ~4 ms which does not seem particularly slow. It all just adds up to large numbers over many runs.
I took a look at the overall structure of loops and recursion in the DFA implementation and did not notice anything with pathological complexity. There's linear progression through AST nodes, sometimes with calls to helpers that may loop over current variables and scopes. popScope() and clearValues() look the heaviest in that regard (with nested loops in the latter), but the loops perform simple operations. These helpers do take non-negligible amount of time, but are still not the bottleneck. Far more time is spent elsewhere.
My overall impression is that with so many calls small things just add up. The running time is spread over relatively simple operations and AST visitation. Speeding it up may actually require lots of tiny optimizations rather than looking for a big bottleneck. Tweaking the data structures might also be worthwhile, though we already did a lot of that in the past, so I'm assuming there's no low-hanging fruit there. Other than that, we could try to reduce the number of calls to these components or the size of input they're getting from other steps.
Conclusions
- I could not identify any obvious bottlenecks in CSE or DataflowAnalyzer in general. The time seems to be spread over a lot of tiny operations instead.
- With things being repeated so many times, trying some microoptimizations may actually be worthwhile here. Especially in DFA's operators, where the seems to be a lot of "dark code" that's hard to profile in isolation but must be taking a large part of the total running time.
Note that these are just initial observations. I did not find anything spectacular, but given how slow these components are, it's definitely worth poking at them more, so I'm not done yet.