gc Performance cost of "unnecessary" casts

We recently discussed the question how expensive "unnecessary" casts are (i.e. casts that a Wasm engine has to perform because the type system can't express that they are unnecessary, even though the module producer knows it statically), and I promised to do some measurements.

TL;DR: the results are somewhat inconclusive, I can both prove that casts are nothing to worry about, and that it is absolutely crucial to avoid them.

I used a test that's very similar to the test @RossTate used to estimate the cost of such casts in the JVM. In pseudo-code (because the original is expressed in rather unwieldy C++ macros found in V8's internals):

struct $Super { i32 $x }
struct $Sub { i32 $x, i32 $y }
array $List { eqref }

global $List global_list

function Prepare() {
  global_list = new $List(1024);
  for (i32 i = 0; i < 1024; i++) {
    global_list[i] = new $Super(i);
    i++;
    global_list[i] = new $Sub(i, 0);
  }
}
function Main() {
  $List local_list = global_list;
  i32 sum = 0;
  for (i32 i = 0; i < 1_000_000_000; i++) {
    sum += ref.cast<$Super>(local_list[i % 1024]).$x;
  }
  return sum;
}

I've measured three configurations: [proto]: Current V8 prototype. Type checks are implemented via a call there, which was convenient to implement but is of course not the fastest approach. [inline]: Replaces the call with an inline instruction sequence in the function's optimized code (in a local patch, not yet landed). Notably the type check sequence is still a loop that walks the inheritance chain; we're not yet making use of the static RTT depth information to implement constant-time subtyping checks. This small benchmark has very short inheritance depths (from the actual object's RTT to the $Super RTT it's either zero or one steps), so if anything I'd expect constant-time checks to be slightly slower in this particular test. [nochecks]: I've commented out the code responsible for emitting any checks whatsoever (including the i31/null checks) as part of the cast (i.e. turned ref.cast instruction into a no-op). This is obviously a spec violation as things stand, and is only useful for determining what the performance would theoretically be if we had a type system that's powerful enough to let us skip such checks safely.

Results, timing only the invocation of the function Main: [proto]: 5.3 seconds [inline]: 5.2 seconds [nochecks]: 5.2 seconds

~2% cost with inefficient checks, no measurable difference with efficient checks, yay! I've verified that the generated assembly is what I would expect, and I've verified that the computed result matches expectations too. But clearly this is a bit too good to be true...

I decided to try to amplify the difference by optimizing the rest of the test. The first thing that stands out is the division i % 1024. The way we currently use our compiler backend for i32.rem_u, that generates an actual div instruction. Replacing that computation with the equivalent i & 1023 gives the following results:

[proto]: 4 seconds [inline]: 2 seconds [nochecks]: 1 second

So the preliminary conclusion is: the type checks done by "unnecessary" casts can have significant cost (easily 2x) in tight loops, but if the loop is doing any additional nontrivial work, modern CPUs have rather amazing abilities to hide that cost, sometimes to the point of not being measurable any more.

I'd be happy to test additional scenarios if someone has suggestions for what might be representative/insightful.

For comparison, I've also measured the JavaScript equivalent: "%" version: 0.97s (smart enough to emit an &1023 instruction!) "&" version: 0.88s While it is disheartening that this handily beats even the "nochecks" Wasm result above, keep it mind that this is an apples-to-oranges comparison for several reasons: (1) V8's JavaScript code is far more thoroughly optimized than our WasmGC prototype. We haven't even started performance work on the latter yet, so surely there is some potential for future improvements. (2) This small test is a scenario that JavaScript has no trouble with. WasmGC's true appeal as an alternative compilation target lies in situations where JavaScript is struggling (and falls off the fast path). (3) The WasmGC MVP proposal itself is not optimized for performance yet; I expect that we'll see efficiency-improving spec additions/tweaks in the future.

Oct 20 '20 22:10 jakobkummerow

It's interesting that the delta between the two [proto] tests is 1.3 seconds, but the delta between the two [nochecks] tests is 4.2 seconds. Did the optimised [nochecks] benchmark get auto-vectorised or is it really all down to CPU pipelining?

Oct 20 '20 23:10 conrad-watt

I've seen many times in discussions (especially from those pushing structural typing more than nominal typing) that predictable performance across underlying HW is a goal.

How much is the statement

modern CPUs have rather amazing abilities to hide that cost, sometimes to the point of not being measurable any more.

relevant in this scenario?

If quite irrelevant, then I'd argue that favouring a type system allowing to eliminate more casts is undoubtedly the way to go as I find the benchmarked difference rather huge.

Oct 21 '20 08:10 dumblob

@conrad-watt : V8 doesn't do auto-vectorization. AFAICT it's down to pipelining.

@dumblob : "predictable performance" in the literal sense is a myth for this and many other reasons. We shouldn't let that distract us. (What matters is "reliably non-bad performance". How fast exactly something is doesn't matter, as long as it doesn't unpredictably and uncontrollably get pathologically slow. Let's keep that conversation separate though.)

Also, minor update that doesn't change anything: it turns out we're about to turn on the %->& optimization (and a few others) by default. However, the fact that it is even applicable is an artifact of how exactly the test was written, which is why that doesn't change the overall picture here. Ross' original JVM test used an array length of 1000, where that optimization is not applicable. (I changed it to 1024 precisely so that I could do the & trick by hand.)

Oct 21 '20 08:10 jakobkummerow

@jakobkummerow how easy would it be to try an array sort example, as @RossTate did in his presentation?

EDIT: one potential complication is that the proposed type system mechanism for eliminating casts when accessing an array still (in general) requires a dynamic check when assigning to the array. The Java benchmark already captures this.

Oct 21 '20 10:10 conrad-watt

Thanks for putting this together, @jakobkummerow!

Ross' original JVM test used an array length of 1000, where that optimization is not applicable.

There is another optimization that might have similar effect. The JVM might be eliminating the % by applying loop-induction-variable strength reduction. That is, the following

i32 sum = 0;
for (i32 i = 0; i < 1_000_000_000; i++) {
  sum += ref.cast<$Super>(local_list[i % 1024]).$x;
}

can be transformed into

i32 sum = 0;
i32 livsr = 0;
for (i32 i = 0; i < 1_000_000_000; i++) {
  sum += ref.cast<$Super>(local_list[livsr]).$x;
  livsr++;
  if (livsr >= 1024)
    livsr -= 1024;
}

Unlike the & trick, this works regardless of what the (positive) modulus is. If I had been thinking, this is probably what I should have done in the first place.

So the preliminary conclusion is: the type checks done by "unnecessary" casts can have significant cost (easily 2x) in tight loops, but if the loop is doing any additional nontrivial work, modern CPUs have rather amazing abilities to hide that cost, sometimes to the point of not being measurable any more.

The numbers are indeed surprising. I was trying to get a measurement of the cost of a cast relative to an array cast, but it seems something I forgot to anticipate is the potential of on-processor parallelism. In the div version, notably the div is so slow that the array access and cast are not even on the critical path. So it's not just that the loop is doing nontrivial work; the loop is bottlenecked by non-trivial work on a path separate from the cast and so they are not even being measured.

From the numbers, it seems that the cast at least doubles the cost of an array access. As with any overhead to an operation, how that affects the performance of the overall program will depend on how the operation is used in the program. For the div case, for at least your CPU, the operation is not on the critical path and so the overhead has no affect on performance. For the & case, the operation is on the critical path and so the overhead halves the performance. However, even then the processor might be using parallelism to stagger array accesses and casts, so this still might not be an accurate measurement of the individual operation's overhead.

Trying to take parallelism into mind, I wonder if there is a good way to make sure the array access would be the entirety of the critical path so that we can get an accurate measure of its (local) overhead. The following program come to mind:

function Prepare() {
  global_list = new $List(1009);
  int index = 859;
  for (i32 i = 0; i < 1024; ) {
    global_list[i] = new $Super(index);
    i++;
    index = (index * 2269) % 1009;
    global_list[i] = new $Sub(index, 0);
    i++;
    index = (index * 2269) % 1009;
  }
}
function Main() {
  $List local_list = global_list;
  i32 index = 0;
  for (i32 i = 0; i < 1_000_000_000; i++) {
    index = ref.cast<$Super>(local_list[index]).$x;
  }
  return index;
}

This should make it so that the critical path is just the array access and cast and the non-critical path is the counting code. Hopefully that would let us measure the overhead for just the specific operation, from which one could then compute how it affects the rest of the program based on how the operation is used.

As a slight variant to see what happens when there's non-trivial work on the same path as the array access and cast, one could take the % 1009 out of Prepare and instead use local_list[index % 1009] in Main.

Oct 23 '20 15:10 RossTate

It's generally a bad idea to use division/modulus in a microbenchmark. The performance of the operation itself is highly variable and typical microarchitectures generally don't have more than one execution unit dedicated to division. It's a wild goose chase to speculate about what optimizations an implementation (e.g. compiler) might or might not do around division, especially division by a constant.

General microbenchmarking caveats apply, including but not limited to loop overhead, alignment, LSD and other frontend effects, etc.

As for the original benchmark, I think the known-depth optimization is critical, so I'd be most interested to see V8's performance in that case.

Oct 24 '20 04:10 titzer

how easy would it be to try an array sort example?

Bubblesort: [proto]: 5.2s [inline]: 3.1s [nochecks]: 2.0s

Benchmark details (types as before):

kListLength = 1<<15;  // 32768
kPrime = 977;
function Prepare() {  // Untimed.
  global_list = new $List(kListLength);
  int j = kListLength / 2;
  for (int i = 0; i < kListLength; ) {
    global_list[i] = new Super(j);
    i++;
    j = (j + kPrime) & (kListLength - 1);
    global_list[i] = new Sub(j, 0);
    i++;
    j = (j + kPrime) & (kListLength - 1);
  }
}
function Main() {  // Timed.
  $List local_list = global_list;
  eqref tmp;
  for (int i = kListLength - 1; i > 0; i--) {
    for (int j = 0; j < i; j++) {
      if (ref.cast<$Super>(local_list[j]).x > ref.cast<$Super>(local_list[j+1]).x) {
        tmp = local_list[j];
        local_list[j] = local_list[j+1];
        local_list[j+1] = tmp;
      }
    }
  }
}

(The local tmp that's used for swapping two array elements could be optimized away if one manipulated the value stack directly, but I've used convenience macros that don't easily allow that. I don't think it affects the generated code.)

Regarding div and ways to avoid it: I've found the use of a div instruction in my first version a useful accident: it gives us an idea of what we might see in a loop that does expensive additional work. I've included the & version so we also have the other data point; I don't think we'd learn much from other variations?

I agree that the known-depth optimization would be interesting, but I'll have to implement it before I can measure it. (As stated before, in the particular tests run so far, I expect it to be slower, because while it is O(1), it will have to perform a few more instructions than we're currently getting away with: the current O(n) code only needs a single cmp when it gets lucky, which is the case for 50% of the elements tested here.)

More results to follow.

Oct 27 '20 23:10 jakobkummerow

I've found some time to implement constant-time subtyping checks, using the statically known RTT depth information. To my surprise, they are slightly faster than the linear-time checks even when the latter get quite lucky:

type checks	array sum	array sum (deeper)	array walk	array sort
chain walk	1.9s	2.7s	2.5s	3.1s
known depth	1.8s	1.8s	2.5s	3.0s
no checks	1.0s	1.0s	2.5s	2.0s
JavaScript	1.0s / 0.9s	1.0s / 0.9s	2.5s	1.5s / 1.2s

Legend:

"chain walk" checks are what's called "[inline]" in my earlier posts above. I haven't measured the earlier "[proto]" version again, as it's outdated and as such didn't seem relevant.
"known depth" indicates the new constant-time checks.
"JavaScript" is, of course, not a type check implementation variant, but the respective test's equivalent as a JavaScript program. The two given results are for the first run and any subsequent runs, respectively.
"array sum" is the same test as before, the "&" variant from the very first post.
"array sum (deeper)" is the same, but with two more RTTs inserted in between, i.e. the RTT inheritance chain is Sub → Mid2 → Mid1 → Super. The array is still filled with alternating elements of types Sub and Super.
"array walk" is approximately the test Ross suggested above, based on the index = array[index].x idea. (I've fixed up the "Prepare" function to make sure all indices are emitted exactly once.)
"array sort" is the same bubblesort test as before.

Comments:

it makes sense that "chain walk" checks get slower for "array sum (deeper)" compared to "array sum", whereas "known depth" a.k.a. constant-time checks don't. JavaScript doesn't care either (it wouldn't even care if there was no subtyping involved at all). It goes without saying that for "no checks", the depth also makes no difference.
"array walk", besides putting the type check on the critical path, also puts the load onto the critical path, which appears to have all sorts of side effects on performance. (The fact that it visits the array's elements in random order doesn't matter.)
I've actually re-run all tests for this post. Pleasingly, the results for configurations that aren't new match the outcomes in my earlier posts :-)
"know depth" checks being faster than "chain walk" might be partially explained by the fact that my current implementation of the constant-time checks also special-cases the case where the object's RTT is exactly the requested RTT; the super-RTT list is only loaded otherwise -- still, I wouldn't have expected that list load, bounds check, element load, and comparison to end up being faster than the looping implementation for shallow inheritence depths.

It's still hard to draw any conclusions. For some of these microbenchmarks, getting a nearly 2x performance increase seems like pretty good motivation for accepting a significantly more complex type system that can avoid these unnecessary checks. On the other hand, we also keep seeing examples where other instructions (whether it's a division or a load) manage to completely hide the cost of the type checks. Most real-world code will likely fall somewhere in between these extremes.

Nov 05 '20 21:11 jakobkummerow

Very interesting! Thanks for all the effort, @jakobkummerow!

Nov 05 '20 21:11 RossTate

Closing this since the design is now stable, so we won't be redesigning anything due to benchmarks. That being said, performance work is ongoing in both tools and engines.

Nov 01 '22 21:11 tlively

gc gc copied to clipboard

Performance cost of "unnecessary" casts

gc
gc copied to clipboard