sdk
sdk copied to clipboard
Performance of switch statement
I'm experimenting with a little VM to evaluate expressions that are only known at runtime.
Unfortunately, just the dispatching of instructions through a long switch statement takes about 50% of the total run time.
It would be great if the Dart VM could optimize switch statements on enums and tightly packed integers, instead of doing a linear scan. That's probably not a trivial thing to implement, so meanwhile are there any alternative approaches with better performance?
You could try to use an array of functions as a replacement. It should be considerably faster than linear searching the switch uses.
It would be great if the Dart VM could optimize switch statements on enums and tightly packed integers, instead of doing a linear scan.
It's actually relatively straightforward to implement this, we have necessary pieces. We could check at the IL construction time if switch looks like case 0: ...; case 1: ...; ...; case N: ...; default: ... and N is large enough and then build a slightly different graph using IndirectGoto instruction we have in the IL.
IIRC long ago we even had a prototype of this optimisation that never landed.
//cc @aam
dart2js converts a switch on enums to a switch on the enum index.
This enables the JavaScript VM to use a jump table if it wants to (example).
The VM could do something similar.
Small switches are often better compiled to branch trees. The VM could do this too, and in the case of an enum, the index has a known range which should allow the elimination of some tests.
I've taken a stab at implementing binary search and jump table optimizations (https://dart-review.googlesource.com/c/sdk/+/253787). There are some heuristics around when to use which optimization, but generally binary search is used for optimizable switch statements with less than 16 expressions and a jump table for the rest.
I'm not all that familiar with the VM, so I could use some feedback. Specifically, there are two issues I wasn't able to fully resolve.
I need to check the type of the expression that is switch on before I can be sure that it has an integer representation.
This check could be omitted in the flow graph if the static type of that expression is known, but I wasn't able to get a type other than dynamic out of InferredTypeMetadata. Is this information available in the kernel?
For optimizing of enum switches, I need to know the length of Enum.values. In JIT I can find the values field on the enum Class and use StaticConstFieldValue. In AOT this causes a crash. I can work around that by taking the count of the static fields of the enum Class and subtract 2 (one static field per enum value - values - _deleted_enum_sentinel), but that seems brittle.
It could be hard to check the type of the expression when building the flow graph, as static types are not written into kernel binary, and InferredTypeMetadata provides inferred types only for certain kinds of AST nodes (such as calls). VM calculates types later during the TypePropagation pass. When building the flow graph, we can probably figure out if switch should be optimized by looking at the case constants and checking if they are either integers or instances of an enum class.
Accessing values field of enums may crash in AOT mode because values field could be removed by the tree-shaker. It is also incorrect to look at the number of fields as recently added enhanced enum classes can declare extra fields in addition to enum elements. However, we can probably just check isExplicitlyExhaustive flag in SwitchStatement nodes (currently skipped here) to avoid generating range check for enum switches with all cases. If the switch is not exhaustive, then we can generate the range check based on the maximum of the enum constant used in case.
Thanks @alexmarkov!
When building the flow graph, we can probably figure out if switch should be optimized by looking at the case constants and checking if they are either integers or instances of an enum class.
That's exactly what I'm doing. I now always emit a type check of the value that is passed into an optimized switch. As far as I understand it, the CallSpezializer eliminates the type check if possible, anyway.
The change is basically ready for review. What's missing is proper testing. I've been using a flag (force-switch-dispatch-type) to override the heuristics for selecting between linear scan, binary search and jump table for testing. The SDK builds and locally the tests in language/switch and co19/Language/Statements/Switch pass for all 3 dispatch types. But I'm not sure where to go from here to automate running these tests for each dispatch type. Any ideas how to approach this?
@blaugold Thank you for implementing this optimization!
You can add multiple // VMOptions=... lines to the header a test (after copyright and description) in order to run the test multiple times with different VM options. Something along the lines:
// VMOptions=
// VMOptions=--force-switch-dispatch-type=0
// VMOptions=--force-switch-dispatch-type=1
// VMOptions=--force-switch-dispatch-type=2
You can do that for a few existing tests in tests/language/switch. Consider also adding tests which would trigger various cases of the switch heuristics without extra options. Tests which are specific to the VM can be added to runtime/tests/vm/dart.
Example of using // VMOptions:
https://github.com/dart-lang/sdk/blob/3344e064168f0e51a1f127aad823fbc6a66d6f72/runtime/tests/vm/dart/splay_ephemeron_test.dart#L13-L31
Thanks again :) I've added tests to the CL. Can I choose tryjobs or does a maintainer have to do that? I'm also not sure which ones to select.
I have selected a few tryjob configurations and they should be running now.
@alexmarkov I have identified the issue because of which the CL was reverted and a fix for it. Are there specific steps for relanding the CL?
@blaugold You can click on the CREATE RELAND button (upper right corner) on the original CL - it will create a new CL with those changes. Then you can download it by running git cl patch <CL_number> on a fresh git branch, apply the fix (along with a regression test) and upload the changes for code review and testing.
Thanks for the contribution @blaugold! If you have any numbers - could you share how much it improved performance of your code? :)
In my code, the improvements are ~7% for JIT and ~25% for AOT.
I have a micro-benchmark with the following results on a MacBook with an M1:
$ dart tool/run.dart
Dispatch benchmark: branches: 8, optimized: false
JIT
Dispatch(int switch)(RunTime): 6.830204132040703 us.
Dispatch(enum switch)(RunTime): 7.101797245506886 us.
Dispatch(if tree)(RunTime): 7.620905447736381 us.
Dispatch(function table)(RunTime): 27.840157848421516 us.
AOT
Dispatch(int switch)(RunTime): 5.645404620541914 us.
Dispatch(enum switch)(RunTime): 5.6308736459711195 us.
Dispatch(if tree)(RunTime): 4.402706 us.
Dispatch(function table)(RunTime): 11.859672561309754 us.
Dispatch benchmark: branches: 8, optimized: true
JIT
Dispatch(int switch)(RunTime): 6.776449252989181 us.
Dispatch(enum switch)(RunTime): 6.934528897309981 us.
Dispatch(if tree)(RunTime): 7.134259664350839 us.
Dispatch(function table)(RunTime): 27.83029919700803 us.
AOT
Dispatch(int switch)(RunTime): 4.39903 us.
Dispatch(enum switch)(RunTime): 4.378646 us.
Dispatch(if tree)(RunTime): 4.392776 us.
Dispatch(function table)(RunTime): 11.855344578621686 us.
Dispatch benchmark: branches: 16, optimized: false
JIT
Dispatch(int switch)(RunTime): 6.829736133093701 us.
Dispatch(enum switch)(RunTime): 7.674957906302617 us.
Dispatch(if tree)(RunTime): 7.037806905482737 us.
Dispatch(function table)(RunTime): 27.847624023759764 us.
AOT
Dispatch(int switch)(RunTime): 8.176496 us.
Dispatch(enum switch)(RunTime): 8.178208 us.
Dispatch(if tree)(RunTime): 5.0132725 us.
Dispatch(function table)(RunTime): 12.260699478601042 us.
Dispatch benchmark: branches: 16, optimized: true
JIT
Dispatch(int switch)(RunTime): 6.9298066579350195 us.
Dispatch(enum switch)(RunTime): 6.978652411607824 us.
Dispatch(if tree)(RunTime): 7.037869905325237 us.
Dispatch(function table)(RunTime): 27.866667583324165 us.
AOT
Dispatch(int switch)(RunTime): 5.0116475 us.
Dispatch(enum switch)(RunTime): 5.027215 us.
Dispatch(if tree)(RunTime): 5.0307754615306735 us.
Dispatch(function table)(RunTime): 12.328716666666667 us.
Dispatch benchmark: branches: 32, optimized: false
JIT
Dispatch(int switch)(RunTime): 8.60424998406256 us.
Dispatch(enum switch)(RunTime): 9.548627966959108 us.
Dispatch(if tree)(RunTime): 7.3036767408081475 us.
Dispatch(function table)(RunTime): 28.03792212077879 us.
AOT
Dispatch(int switch)(RunTime): 13.052454711499086 us.
Dispatch(enum switch)(RunTime): 13.003340981624602 us.
Dispatch(if tree)(RunTime): 5.468552917192221 us.
Dispatch(function table)(RunTime): 11.87404119277344 us.
Dispatch benchmark: branches: 32, optimized: true
JIT
Dispatch(int switch)(RunTime): 6.951462177972014 us.
Dispatch(enum switch)(RunTime): 7.189731025672436 us.
Dispatch(if tree)(RunTime): 7.259695850760373 us.
Dispatch(function table)(RunTime): 27.93467440325597 us.
AOT
Dispatch(int switch)(RunTime): 5.0250775 us.
Dispatch(enum switch)(RunTime): 5.00867 us.
Dispatch(if tree)(RunTime): 5.47583 us.
Dispatch(function table)(RunTime): 11.9983785089182 us.
Dispatch benchmark: branches: 64, optimized: false
JIT
Dispatch(int switch)(RunTime): 13.141367722477526 us.
Dispatch(enum switch)(RunTime): 13.614759465133076 us.
Dispatch(if tree)(RunTime): 7.660115424855719 us.
Dispatch(function table)(RunTime): 28.392949642302145 us.
AOT
Dispatch(int switch)(RunTime): 22.79647650440437 us.
Dispatch(enum switch)(RunTime): 22.65580717681986 us.
Dispatch(if tree)(RunTime): 5.805448840464529 us.
Dispatch(function table)(RunTime): 11.879428913140977 us.
Dispatch benchmark: branches: 64, optimized: true
JIT
Dispatch(int switch)(RunTime): 7.136219659450852 us.
Dispatch(enum switch)(RunTime): 7.289474666666667 us.
Dispatch(if tree)(RunTime): 7.6335804580244275 us.
Dispatch(function table)(RunTime): 28.152921082473505 us.
AOT
Dispatch(int switch)(RunTime): 4.994777756527804 us.
Dispatch(enum switch)(RunTime): 5.0108925 us.
Dispatch(if tree)(RunTime): 5.828400300299474 us.
Dispatch(function table)(RunTime): 12.004893223087272 us.
Unfortunately, the change had to be reverted again, but it's not immediately obvious to me what the issue is.
@blaugold The change was reverted because there are new failures on the bots which test hot reload:
https://dart-ci.appspot.com/log/vm-kernel-reload-linux-debug-x64/dartk-reload-linux-debug-x64/9528/language_2/switch/backward_jump_test/1
See https://github.com/dart-lang/sdk/issues/49789 for the full list of failures.
There are also crashes on vm-kernel-optcounter-threshold-linux-release-ia32 bot which look related:
https://dart-ci.appspot.com/log/vm-kernel-optcounter-threshold-linux-release-ia32/dartk-optcounter-linux-release-ia32/8018/standalone_2/http_launch_test/1
I'm not sure why this wasn't caught during pre-submit testing, as I added those vm-kernel-reload-* and vm-kernel-optcounter-threshold-* bots to your change.
@alexmarkov Yeah, I only saw the logs with the segfaults from the release builds, but in the debug builds and vm-kernel-optcounter-threshold-linux-release-ia32 asserts are hit which should be helpful.
I'm not sure why this wasn't caught during pre-submit testing, as I added those vm-kernel-reload-* and vm-kernel-optcounter-threshold-* bots to your change.
I also have not been able to replicate those failures locally yet.
I was able to reproduce hot reload failure using
tools/test.py -n dartk-reload-linux-debug-x64 language_2/switch/fallthru_runtime_test
Created https://github.com/dart-lang/sdk/issues/49790 about the infra problem which prevented from catching those failures earlier.
Perhaps worth clarifying that landed PR implemented optimization in AOT only, creating a separate issue for extending it to JIT as well.