sdk icon indicating copy to clipboard operation
sdk copied to clipboard

Dart VM `@pragma("vm:prefer-inline")` vs manually inlining

Open jensjoha opened this issue 1 year ago • 4 comments

I've performed the following experiment on top of 7512050101f25530824db6200f4770ed8f083251.

In https://dart-review.googlesource.com/c/sdk/+/349861/ I have the two versions I'm testing:

  • Patchset # 1 where I've manually inlined declareMember and undeclareMember into the calls to .forEach.
  • Patchset # 2 where I've tried to force the VM to inline the calls instead, by calling declareMember and undeclareMember into the calls to .forEach, and marking these two as @pragma("vm:prefer-inline"). (e.g. the difference)

I forgot to include the file pkg/front_end/verify.dart in the CL, the content is this for both versions:

import "dart:io";

import "package:kernel/kernel.dart";
import "package:kernel/target/targets.dart";
import "package:kernel/verifier.dart";
import "package:vm/target/vm.dart";

void main(List<String> args) {
  if (args.length != 1) throw "Usage: dart <script> <dill>";
  File f = new File(args.single);
  Component component = loadComponentFromBytes(f.readAsBytesSync());
  Target target = new VmTarget(new TargetFlags());
  Stopwatch stopwatch = new Stopwatch()..start();
  verifyComponent(
    target,
    VerificationStage.afterModularTransformations,
    component,
  );
  print("Verified in ${stopwatch.elapsedMilliseconds} ms");
}

I then aot-snapshot compiled like this:

patchset # 1, manually inlining:

out/ReleaseX64/dart-sdk/bin/dart compile aot-snapshot --verbose --extra-gen-snapshot-options="--print_flow_graph_optimized,--dwarf-stack-traces,--code-comments,--write-code-comments-as-synthetic-source-to=/tmp/comments_manual.txt" pkg/front_end/verify.dart 2> /dev/null
mv pkg/front_end/verify.aot pkg/front_end/verify.aot.patchset1

patchset # 2, vm should inline

out/ReleaseX64/dart-sdk/bin/dart compile aot-snapshot --verbose --extra-gen-snapshot-options="--print_flow_graph_optimized,--dwarf-stack-traces,--code-comments,--write-code-comments-as-synthetic-source-to=/tmp/comments_vm.txt" pkg/front_end/verify.dart 2> /dev/null
mv pkg/front_end/verify.aot pkg/front_end/verify.aot.patchset2

And compared the number of instructions when running:

out/ReleaseX64/dart pkg/front_end/tool/benchmarker.dart --iterations=3 --snapshot=pkg/front_end/verify.aot.patchset1 --snapshot=pkg/front_end/verify.aot.patchset2 --arguments=out/ReleaseX64/vm_platform_strong.dill

giving this result:

[...]
instructions:u: 0.0844% +/- 0.0005% (747256.67 +/- 4682.73)

So wanting the VM to inline is 0.0844% slower (executes 0.0844% more instructions).

There might be better ways to do this, but what I did was that I used GDB to look at the disassembly:

gdb --silent -ex 'set pagination off' -ex "p 'VerifyingVisitor.visitComponent'" -ex "disassemble /s \$1" -ex "quit" pkg/front_end/verify.aot.patchset1 > disassemble_patchset1
gdb --silent -ex 'set pagination off' -ex "p 'VerifyingVisitor.visitComponent'" -ex "disassemble /s \$1" -ex "quit" pkg/front_end/verify.aot.patchset2 > disassemble_patchset2

Where the main differences (with my untrained eyes) are:

  • Stuff with "problem":
    • VM inlined has 5 calls each of _iso_stub_AllocateArrayStub, _StringBase._interpolate and VerifyingVisitor.problem.
    • Manually inlined has 2 of each.
  • Stuff with _iso_stub_AllocateContextStub and _iso_stub_AllocateClosureStub
    • VM inlined has 18 of each
    • Manually inlined has 1 _iso_stub_AllocateContextStub and 18 _iso_stub_AllocateClosureStub

So the way I'm reading it the VM somehow optimizes less when it does the inlining which is not what I would have expected. Is there a good reason for this or is something off?

(command lines for everything assuming we're in the sdk root, at 7512050101f25530824db6200f4770ed8f083251 and have a compiled sdk:

cat <<< 'import "dart:io";

import "package:kernel/kernel.dart";
import "package:kernel/target/targets.dart";
import "package:kernel/verifier.dart";
import "package:vm/target/vm.dart";

void main(List<String> args) {
  if (args.length != 1) throw "Usage: dart <script> <dill>";
  File f = new File(args.single);
  Component component = loadComponentFromBytes(f.readAsBytesSync());
  Target target = new VmTarget(new TargetFlags());
  Stopwatch stopwatch = new Stopwatch()..start();
  verifyComponent(
    target,
    VerificationStage.afterModularTransformations,
    component,
  );
  print("Verified in ${stopwatch.elapsedMilliseconds} ms");
}' > pkg/front_end/verify.dart

git fetch https://dart.googlesource.com/sdk refs/changes/61/349861/1 && git checkout FETCH_HEAD

out/ReleaseX64/dart-sdk/bin/dart compile aot-snapshot --verbose --extra-gen-snapshot-options="--print_flow_graph_optimized,--dwarf-stack-traces,--code-comments,--write-code-comments-as-synthetic-source-to=/tmp/comments_manual.txt" pkg/front_end/verify.dart 2> /dev/null
mv pkg/front_end/verify.aot pkg/front_end/verify.aot.patchset1

git fetch https://dart.googlesource.com/sdk refs/changes/61/349861/2 && git checkout FETCH_HEAD

out/ReleaseX64/dart-sdk/bin/dart compile aot-snapshot --verbose --extra-gen-snapshot-options="--print_flow_graph_optimized,--dwarf-stack-traces,--code-comments,--write-code-comments-as-synthetic-source-to=/tmp/comments_vm.txt" pkg/front_end/verify.dart 2> /dev/null
mv pkg/front_end/verify.aot pkg/front_end/verify.aot.patchset2

out/ReleaseX64/dart pkg/front_end/tool/benchmarker.dart --iterations=3 --snapshot=pkg/front_end/verify.aot.patchset1 --snapshot=pkg/front_end/verify.aot.patchset2 --arguments=out/ReleaseX64/vm_platform_strong.dill

gdb --silent -ex 'set pagination off' -ex "p 'VerifyingVisitor.visitComponent'" -ex "disassemble /s \$1" -ex "quit" pkg/front_end/verify.aot.patchset1 > disassemble_patchset1
gdb --silent -ex 'set pagination off' -ex "p 'VerifyingVisitor.visitComponent'" -ex "disassemble /s \$1" -ex "quit" pkg/front_end/verify.aot.patchset2 > disassemble_patchset2

grep "_iso_stub_AllocateArrayStub" disassemble_patchset1 | wc -l
grep "_iso_stub_AllocateArrayStub" disassemble_patchset2 | wc -l
grep "_StringBase._interpolate" disassemble_patchset1 | wc -l
grep "_StringBase._interpolate" disassemble_patchset2 | wc -l
grep "VerifyingVisitor.problem" disassemble_patchset1 | wc -l
grep "VerifyingVisitor.problem" disassemble_patchset2 | wc -l
grep "_iso_stub_AllocateContextStub" disassemble_patchset1 | wc -l
grep "_iso_stub_AllocateContextStub" disassemble_patchset2 | wc -l
grep "_iso_stub_AllocateClosureStub" disassemble_patchset1 | wc -l
grep "_iso_stub_AllocateClosureStub" disassemble_patchset2 | wc -l

)

jensjoha avatar Feb 02 '24 08:02 jensjoha

The code is awful in both cases. A lot of iterator code is not inlined.

This is not a comparison of VM inlining vs manual inlining. This is actually a comparison of explicit closures passed to forEach vs tear-offs created in a loop. The big difference between them is that explicit closures can share the same context which is allocated once, but each tear-off has a separate context which is allocated multiple times in a loop (unless it is optimized out). Also note that @pragma("vm:prefer-inline") doesn't affect taking a tear-off (which calls method extractor), it only affects calls of the annotated method.

I can see that in the patchset 2, even when tear-offs of methods declareMember and undeclareMember are inlined, closure and context allocations are not eliminated for some reason even though they are not used. I'm going to create a CL to fix this, but this is only a half of the problem.

When forEach is called for members of classes (in the for (Class class_ in library.classes) loops), forEach is not inlined as lists are polymorphic (can be either _GrowableList or DirtifyingList). Without inlining forEach, closure arguments of forEach have to be allocated, which makes tear-offs (PS2) slower compared to explicit closures (PS1) once again due to extra context allocations.

Maybe we could represent taking a tear-off as a high-level instruction in IL and make sure we can CSE and LICM these instructions, so compiler could automatically deduplicate and hoist taking tear-offs out of the loops. This would probably make them more efficient compared to explicit closures in this example. And of course we should be able to eliminate closure and context allocations if the are not used.

alexmarkov avatar Feb 05 '24 18:02 alexmarkov

Unused context and closure allocations are not removed in this code because of the try / finally: allocation sinking optimization which removes unused allocations is disabled if there is a try block in a function. Related issue: https://github.com/dart-lang/sdk/issues/43234.

alexmarkov avatar Feb 06 '24 16:02 alexmarkov

Would it make a difference if we made it

  @override
  void visitComponent(Component component) {
    try {
      tryPart();
    } finally {
      finallyPart();
    }
  }

  void tryPart() { ... }
  void finallyPart() { ... }

?

jensjoha avatar Feb 07 '24 07:02 jensjoha

@jensjoha Avoiding try/finally in this case would eliminate a few context and closure allocations (you also need to mark extracted methods with @pragma('vm:never-inline') to make sure they are not inlined back), but this is only a part of the difference between these versions of the code.

I started working on optimizing tear-offs in the VM: 1) added a benchmark which simulates repeated usage of tear-offs somewhat similar to your code (e7eb0aaa20c7117085f0f875ca6e9467d7e10b0c); 2) landed the change to remove context allocations from all tear-offs (a261196ea78ec1b4b5ea4a55ea5bc13a5e792e27). Now taking a tear-off should be at least as efficient as creating an explicit closure. I still consider implementing de-duplication of tear-offs (CSE/LICM), but that might require some clarifications in the language spec (see https://github.com/dart-lang/language/issues/3612 for details).

After landing a261196ea78ec1b4b5ea4a55ea5bc13a5e792e27, I still see the performance difference between the code using tear-offs and explicit closures on your example. While number of closure and context allocations is now the same, the remaining problem is that explicit closures work on Field and Procedure objects, while declareMember() works on Member. They calls transformerFlags getters and setters which have 2 implementations - declared in Member and overridden in Procedure. So declareMember needs to account for this polymorphism by performing class id test and branching between these implementations at each use of transformerFlags, but each explicit closure works on exactly one type of a member, so they inline one implementation and don't have class id tests. I don't think VM can optimize out this subtle difference - it's just the tear-offs version is written in more polymorphic way.

@jensjoha If you're actively working on improving performance of this code, I would suggest you to try to remove polymorphism around transformerFlags (can we get rid of Procedure.transformerFlags overrides?) and also change Class.procedures, Class.fields and Class.constructors to return monomorphic lists (can we get rid of DirtifyingList?). That could have positive impact on other places which use transformerFlags and iterate through members of classes.

alexmarkov avatar Feb 13 '24 21:02 alexmarkov

8e7137ea818ed85fc8fd19ad7eb495223dd75278 concludes the work on optimizing tear-offs identified in this issue.

In order to work around polymorphic transformerFlags in declareMember/undeclareMember, I added specialized declareField/declareConstructor/declareProcedure and undeclareField/undeclareConstructor/undeclareProcedure to the patchset 2:

diff --git a/pkg/kernel/lib/verifier.dart b/pkg/kernel/lib/verifier.dart
index 802c54405e2..61ac146bfb6 100644
--- a/pkg/kernel/lib/verifier.dart
+++ b/pkg/kernel/lib/verifier.dart
@@ -292,11 +292,27 @@ class VerifyingVisitor extends RecursiveResultVisitor<void> {
     member.transformerFlags |= TransformerFlag.seenByVerifier;
   }
 
+  @pragma("vm:prefer-inline")
+  void declareField(Field field) => declareMember(field);
+  @pragma("vm:prefer-inline")
+  void declareConstructor(Constructor constructor) =>
+      declareMember(constructor);
+  @pragma("vm:prefer-inline")
+  void declareProcedure(Procedure procedure) => declareMember(procedure);
+
   @pragma("vm:prefer-inline")
   void undeclareMember(Member member) {
     member.transformerFlags &= ~TransformerFlag.seenByVerifier;
   }
 
+  @pragma("vm:prefer-inline")
+  void undeclareField(Field field) => undeclareMember(field);
+  @pragma("vm:prefer-inline")
+  void undeclareConstructor(Constructor constructor) =>
+      undeclareMember(constructor);
+  @pragma("vm:prefer-inline")
+  void undeclareProcedure(Procedure procedure) => undeclareMember(procedure);
+
   void declareVariable(VariableDeclaration variable) {
     if (variableDeclarationsInScope.contains(variable)) {
       problem(variable, "Variable '$variable' declared more than once.");
@@ -374,32 +390,32 @@ class VerifyingVisitor extends RecursiveResultVisitor<void> {
           }
         }
 
-        library.fields.forEach(declareMember);
-        library.procedures.forEach(declareMember);
+        library.fields.forEach(declareField);
+        library.procedures.forEach(declareProcedure);
         for (Class class_ in library.classes) {
-          class_.fields.forEach(declareMember);
-          class_.procedures.forEach(declareMember);
-          class_.constructors.forEach(declareMember);
+          class_.fields.forEach(declareField);
+          class_.procedures.forEach(declareProcedure);
+          class_.constructors.forEach(declareConstructor);
         }
         for (ExtensionTypeDeclaration extensionTypeDeclaration
             in library.extensionTypeDeclarations) {
-          extensionTypeDeclaration.procedures.forEach(declareMember);
+          extensionTypeDeclaration.procedures.forEach(declareProcedure);
         }
       }
       visitChildren(component);
     } finally {
       for (Library library in component.libraries) {
-        library.fields.forEach(undeclareMember);
-        library.procedures.forEach(undeclareMember);
+        library.fields.forEach(undeclareField);
+        library.procedures.forEach(undeclareProcedure);
         for (Class class_ in library.classes) {
-          class_.fields.forEach(undeclareMember);
-          class_.procedures.forEach(undeclareMember);
-          class_.constructors.forEach(undeclareMember);
+          class_.fields.forEach(undeclareField);
+          class_.procedures.forEach(undeclareProcedure);
+          class_.constructors.forEach(undeclareConstructor);
         }
 
         for (ExtensionTypeDeclaration extensionTypeDeclaration
             in library.extensionTypeDeclarations) {
-          extensionTypeDeclaration.procedures.forEach(undeclareMember);
+          extensionTypeDeclaration.procedures.forEach(undeclareProcedure);
         }
       }
       variableStack.forEach(undeclareVariable);

This variant has smaller number of instructions retired than patchset 1.

I'm not really suggesting to use this variant of code: as I mentioned above, a better approach would be to make transformerFlags monomorphic and make Class.fields/procedures/constructors return a single list type to make .forEach monomorphic. This would make all code using transformerFlags and Class.fields/procedures/constructors more efficient and this artificial specialization of declareMember/undeclareMember won't be needed.

alexmarkov avatar Mar 14 '24 15:03 alexmarkov