gc Permanent Slow Casting

Permanent Slow Casting

Open RossTate opened this issue 3 years ago • 16 comments

A consequence of the issues raised in #116 is that, unless we somehow manage to solve a longstanding open research problem, we will never be able to eliminate "superfluous" casts even in extensions to the current MVP. For example, despite significant effort, no research team has been able to extend the current MVP's type system to guarantee that a String[] in Java contains only String values, and as such a Java compiler to even future versions of WebAssembly is likely to always have to compile an access to a String[] with a load followed by a cast to String. Similarly, every virtual/interface method in Java/C#/Kotlin/Scala will have to cast the this pointer, and early every closure in OCaml/Haskell will have to cast its environment and inputs and will often have its output cast after the call returns.

A consequence of the issues raised in #118 is that we will likely never be able to add more efficient/specialized casting mechanisms to the current MVP beyond ref.cast_exact as suggested by @tebbi here.

While it's possible these issues will be addressed, they have been open issues (offline) for multiple years, and there's enough evidence that they're insurmountable that we should evaluate the current MVP with the expectation that these issues won't be resolved.

The obvious question then is what is the overhead of these casts that we can't expect to get rid of? Part of that question is hard to answer because it depends on how many superfluous casts there are relative to other computation, and that will vary by benchmark. But part of that question is easier to answer because the casting mechanism in the current MVP is the same as that for the JVM, and the JVM has already been heavily optimized for this casting mechanism. So we can at least get a sense of how much superfluous casts cost individually through a microbenchmark.

For that microbenchmark, I considered the following Java program:

public static int sum(Super[] supers) {
    int sum = 0;
    for (int i = 0; i < max; i++)
        sum += supers[i % 1000].x;
    return sum;
}

where Super has a field int x and two subclasses Sub1 and Sub2 (with no additional fields). (And where supers is an array of 1000 distinct values, and where max is Integer.MAX_VALUE.)

Although the JVM can guarantee that supers contains only Super values and as such the access to x needs no cast, the current proposal's type system cannot guarantee this and so the generated WebAssembly needs to insert a superfluous cast to Super (using its corresponding rtt) in order to type-check. To measure this overhead, I changed the array to an Object[] and inserted a cast to Super and compared the run times (across multiple iterations with warm up and so on). The inserted cast makes the program run 48% slower (with std. dev. of 1%).

Now if Super were instead final, i.e. was guaranteed to have no subclasses, this cast can be optimized to just check that each object's rtt array is the same as that for Super, rather than checking a particular element in this array. This optimization could be added to the MVP via a ref.cast_exact instruction. The JVM indeed supports this optimization. When I cast instead to a final class, the program ran 26% slower (with std. dev. of 1%).

Then, in the spirit of #109, I encoded Super as an int[] (the JVM does not support butterflies) and simply accessed the field as an array element (which the generator would know exists, though the WebAssembly engine will have to perform an array-bounds check). This version of the program ran only 16% slower (with std. dev. of 1%).

I could not think of a way of measuring pointer-tagging casting mechanisms on the JVM.

Although this is not a direct measurement of casting times, it is enough to conclude that an rtt.cast with a statically known rtt likely takes at least 3x the time an assert_bounds 0 1 takes, and likely takes at least 2x the time an rtt.cast_exact takes. Thus the current MVP is likely permanently sticking WebAssembly with the slowest-by-far casting mechanism, one which even the optimized variant of is still substantially slower than an array-bounds check.

The microbenchmark I used can be found here. I ran this on my laptop a few times, after closing foreground and background processes and such. Benchmarking on the JVM is difficult, and this certainly is an imperfect experimental setup, but it did at least give consistent results across runs.

Aug 19 '20 02:08 RossTate

For example, despite significant effort, no research team has been able to extend the current MVP's type system to guarantee that a String[] in Java contains only String values, and as such a Java compiler to even future versions of WebAssembly is likely to always have to compile an access to a String[] with a load followed by a cast to String.

I think it could work as follows: Introduce "readonly" in addition to "var" and "const" mutability indicators in the structural type system. Covariant subtyping for readonly arrays is allowed. Pass Java arrays as readonly arrays of the current static Java type. Covariant subtyping works as expected. Reads to these arrays are for free, without any cast not necessary in Java. For writing, add something like a virtual method to the array that casts the value to be written to the true array element type and then writes the value. This corresponds to the dynamic type check that JVM implementations have to do too.

Aug 19 '20 10:08 tebbi

@tebbi Unfortunately there are a number of assumptions in your suggestion that start to get why this is such a hard problem despite seeming straightforward.

For writing, add something like a virtual method to the array that casts the value to be written to the true array element type and then writes the value.

Obviously this incurs overhead, but lets put that aside. Type-checking a virtual method (or an interface method or a closure call) at a low level like WebAssembly's is known to require existential types or something similar like type fields.

A virtual-method call works by loading the method from the object's fields (or v-table), and then passing that object as the first argument to the method (i.e. as the value of the "this" pointer) along with the surface-level arguments. The issue is that a program using a low-level instruction set like WebAssembly's could pass some other value as the "this" pointer. In your example, a simple type system can only ensure that that value is some array, but not necessarily the array that the virtual method was meant to be paired with, and in particular not necessarily with the same "true array element type". That's why any simply typed MVP will necessarily have to superfluously cast the "this" pointer in all virtual/interface-method implementations - the type system cannot guarantee that the argument for "this" actually has the expected type. Existential types and type members fix this by letting values state that they have some unknown "exact" type, and by letting function pointers stored in objects demand that the "this" pointer have that same unknown "exact" type.

Covariant subtyping works as expected.

Here you're assuming that the Post-MVP's encoding of a type like java.lang.String is a subtype of the Post-MVP's encoding of a type like java.lang.Object. But for the reasons above, the PostMVP's encoding of a class must have bounded a type member, and #116 points out that this extension makes the Post-MVP's subtyping system undecidable. A standard cannot have an undecidable type system because there would be no way for browser's to type check it consistently. So this problem is solved by weakening subtyping and having instructions essentially proving how to coerce from the encoding of java.lang.String to the encoding of java.lang.Object. But these are instructions, not subtyping, and so covariance won't work as you expect here.

Hopefully I managed to illustrate part of way this is so challenging, though be aware that this is still just the tip of the iceberg. That said, there are known techniques for addressing these problems in a way that is both easy to generate and easy to validate, but their first step is to switch to a more tractable nominal system as suggested in #119. And no, the utility of that suggestion does not apply to only surface languages with nominal type systems; the nominality describes the data structures used by the compiler/runtime more so than those used by the surface language, and as such applies to a wide variety of languages.

Aug 19 '20 14:08 RossTate

And if Java (or any other language) were to be compiled to linear memory instead, it could use the full knowledge of any invariants its type system knows about, and thus omit unnecessary casts or range checks entirely (using only the memory range check, which is close to "free" on wasm32 in browser engines currently).

Just needs some stack inspection, and host interop features instead.

Aug 20 '20 17:08 aardappel

Nice test program to show the problem.

Aug 20 '20 20:08 Horcrux7

@aardappel Theoretical such a linear memory access should be very fast. But a self written GC will not be very effizient. And such GC can not run on a different thread like any modern GC. There will breaks. The breaks will be larger if the memory grow.

Aug 20 '20 20:08 Horcrux7

But a self written GC will not be very effizient.

Why not? It can be custom to the language, so it has many ways in which it can be better than a generic one. It just needs a stack inspection feature to be competitive, like I mentioned.

And such GC can not run on a different thread like any modern GC.

Why not? Anything about the threads proposal that would not work for GC?

Aug 20 '20 21:08 aardappel

@aardappel @Horcrux7 The last two comments seem to be on a separate topic. I think understanding casting performance is one of many items that will help inform a discussion on that topic, and I would like to focus on casting here so that we can first examine that item in more depth.

Aug 20 '20 22:08 RossTate

Been mulling the micro-benchmark re type casting. The biggest weakness is drawing a comparison between Java's casting machinery and a putatitive WASM cast. I know that Java's type cast can (does?) involve a search down a list. But, one could imagine crafting a C-style code fragment that includes an operation that would count as a WASM type cast:

typedef struct {
  int rtt;
  int x;
} Super;

int sum(Super[] supers,int rtt) {
    int sum = 0;
    for (int i = 0; i < max; i++){
        Super *el = &supers[i%10000];
        assert(el->rtt==rtt); // replace == with whatever WASM cast involves, e.g., sub-type check.
        sum += el->x;
   }
    return sum;
}

Haven't tried this, but WDYT?

Aug 20 '20 22:08 fgmccabe

That's not how a Java cast to a (non-final) class works nor how an rtt cast is typically implemented. (Sorry to be abrupt.) They both use the same cast mechanism, which at its core is a lookup into an array of rtt identifiers embedded within the object's meta-data. In both cases (assuming the rtt is statically known), the relevant offset within the meta-data is known at compile time, and in both cases there first needs to be a check that the meta-data is large enough to have such an offset. In short, a cast for both does the following: load the meta-data from the object, check that the meta-data is large enough, load the value from the predetermined offset, and compare that the value is equal to the cast target.

Aug 20 '20 22:08 RossTate

Depends. If you cast a Java object to an interface, then a search is required. You do not know at compile time which of an object's interfaces is the right one.

Aug 20 '20 23:08 fgmccabe

That's true, which is why the microbenchmark only ever casts to classes.

Aug 20 '20 23:08 RossTate

In any case, it might be better to have a 'pure' C benchmark that mimics potential behavior of WASM cast instructions than relying on actual behavior of cast in Java (say).

Aug 21 '20 06:08 fgmccabe

I suspect a C benchmark would perform the same or worse. It would not incorporate the decades of engineering the JVM has benefited from (remember that every write to an Object[] involves a cast, and to a dynamically determined target type at that), nor would it have the opportunity to benefit from speculative JITing optimizations like inline caching, which some have suggested employing to mitigate casting overhead. So this microbenchmark captures the overhead that would still be around even after decades of engineering effort and even employing techniques that WebAssembly is ostensibly not supposed to rely on.

Aug 21 '20 10:08 RossTate

Obviously some misunderstanding here. The purpose of the C (not C++) benchmark is not to compare with Java's type casting. Its purpose is to expose the true cost of a WASM cast; without requiring full implementations of WASM-GC.

Aug 21 '20 14:08 fgmccabe

Just wondering what if we apply explicit inline cache mechanism. How this change benchmark results?

Aug 23 '20 10:08 MaxGraey

Not sure. It's possible the JVM is already employing such a technique. But also, my understanding is that WebAssembly is not supposed to rely on inline caching for good performance.

Aug 23 '20 23:08 RossTate

We have reached phase 3 and the design is stable. We also have good performance results in practice, although we're still much slower than the JVM on many benchmarks. Closing this since it is not actionable at this point beyond ongoing optimization work in tools and engines.

Nov 01 '22 20:11 tlively

gc gc copied to clipboard

Permanent Slow Casting

gc
gc copied to clipboard