Polymorphic devirtualization and funcref not being a subtype of eqref?
As mentioned in the last GC CG meeting, devirtualization helps quite a lot on j2cl, something like a 41% speedup. That handles the case of a single call target being possible, that is, we load from a vtable and binaryen can infer that the vtable must contain a particular function reference, and so we replace the load with that reference, which then allows the call to become direct and even inlined.
Looking into the polymorphic case, that is, where there is a small number of possible function references but more than one, I was hoping to do something like this:
(struct.get vtable)
=>
(select
(first possible function)
(second possible function)
(ref.eq (struct.get vtable) (first possible function))
)
Later optimizations can then replace a call_ref of a select of two function references into an if over two possible calls, etc. However, the condition of the select hits a problem, as funcref is not a subtype of eqref - function references cannot be compared for equality.
I couldn't find a detailed discussion of that, but IIRC the motivation was to allow VMs to optimize things like folding two identical functions into a single one, etc. That sounds reasonable, but the devirtualization issue shows that might be an optimization tradeoff which is not obvious?
Gathering some data, if I disable validation in binaryen then allowing 2 functions instead of 1 leads to 14K more places where we can turn a get from a vtable into a constant (well, a select over constants). Allowing 3 raises that to 17K, and 4 to 19K. (At some amount this becomes less useful, though, of course.) For comparison, the total number of call_refs is 42K, so even with 2 functions we are talking about potentially optimizing away a third of indirect call sites, which sounds like it could be very significant.
Alternatives:
- Rewrite the types, replacing the funcref with an
i32index that we can select on. The problem is that we'd need to replace the vtable field in all relevant subtypes and supertypes, which may not be practical in general. Adding an additional field is another option, but would add memory and runtime overhead. - Use
ref.teston the vtable. That might work if the different functions come from different types, which I believe is the general case (but I'd need to check). How fast isref.testexpected to be? - Perform such devirtualization in the VM and not the toolchain. Doing it statically is probably not reasonable (as a large LTO-style optimization), but using runtime profiling data it might be.
I think we need to have a meeting where we develop a more concerted plan to address performance issues. For example, we have preliminary data that suggests implementing call_ref differently can improve the performance of call_ref-intensive programs by 40%. (I'm hoping to have more finalized data on that in the next two weeks.) That change in implementation might not work well with funcref being a subtype of eqref.
This is not to say we should not explore the sort of speculative specialization you're investigating; in fact, speculative specialization based on the class (i.e. the whole v-table, rather than individual methods) is pretty common. From what I can tell, you're essentially implementing the first step of guarded method inlining. But since you're doing this for v-table functions, it sounds like you're only doing this for class methods, whereas there's even more to be gained for interface methods.
@RossTate Interesting, I'd definitely be curious to hear more about call_ref optimization opportunities. A meeting sounds good as well.
I think this might still be worth discussing since we can't really devirtualize bimorphic call sites on the producer side without being able to compare function references for equality.
Can you compare the metaobjects instead?
Yeah, we might be able to do a ref.test to check which of two statically-known-to-be-possible type subtrees a method receiver belongs to and do a direct call depending on the result. @kripken, wdyt?
@titzer @tlively Using the metaobject is a good idea, yes. We do something like that in the GSI pass:
(struct.get $vtable-type i
(..ref..)
)
;; => ;;
(select
(value1)
(value2)
(ref.eq
(..ref..)
(global.get $global1)
)
)
If a location reads from a particular vtable type, and there are just two vtables of that type in the whole program, and they are stored in globals (all assumptions that are true in J2Wasm in some cases), then we can do a select between their values.
Sadly, this didn't give us significant benefits so far. I'll investigate and perhaps expand that pass that at some point.
Note though that even if we could do this perfectly, it would not be as good as a function comparison I think, since in theory comparing to a global is slower than comparing to a function (the global is known at link time but the function at compile time). Not sure if that matters though.
Anyhow, I wouldn't block the MVP on this. It does seem like there are downsides to allowing function comparisons (like not being able to merge all duplicate functions). And in principle we can always extend post-MVP by allowing comparisons if we decide it's worth it, as that would just make more code validate.
I don't have data on this yet, but some notes:
To clarify better the case we can optimize better with comparable function references, that is when there are just a few target functions despite having many vtables/itables:
vtableCat = [A];
vtableDog = [B];
vtableRabbit = [A];
vtableBird = [B];
vtableFish = [A];
In this case
vtable[0]()
can be optimized to
if (vtable[0] == A) A() else B()
Larger, but now we have direct calls which we can inline and further optimize.
I'll gather data for J2Wasm on this, but that data would just be on the current state of things there, and the real question is how common that situation will be in the future, compared to the other situation of just 2 possible tables (where we can compare the tables themselves):
vtableCat = [A];
vtableDog = [B];
vtable[0]()
=>
if (vtable == vtableCat) A() else B()
Another thought, we could replace all function references in vtables and itables with indexes into a big (immutable in practice) table, and then use call_indirect over call_ref. We could then compare those indexes instead of comparing the function references themselves (which would also save loading from the table, and comparing ints in general is faster than operations on function references, I am told). However, this would not work if call_ref supports subtyping but call_indirect does not and the codebase uses function subtyping.
I think we had tentatively started leaning towards function subtyping on call_indirect?
The best path forward on this issue would be to get some experimental performance data to see how much better it would be to optimize based on the function references directly.
I gathered some empirical data today. The J2Wasm binary has 23,387 CallRefs, and I counted which of them the optimizer could infer 2 possible function references for, which was 12,111. That seems high - over half - but looking more in depth this is a lot less promising. I didn't automate the next part, but I looked at 10 random cases in depth to see what happens there, and none of them would clearly benefit from function refs being comparable for equality, due to the following reasons:
- A few function pairs contain one function whose body is just an
unreachable. We can assume that is never called anyhow inwasm-opt's traps-never-happen mode, so we could emit a direct call to the other. - A few function pairs are trivial methods that return 0 or 1. I am fairly sure there are exactly two because functions were merged together, because each appear in multiple vtables and the names look suspect (when we merge, we just use the name of the first function). Merging functions depends on not being able to compare them, so it's not clear we'd benefit from comparisons. Also, even if comparisons would help, I wonder if we could optimize this in another way - perhaps j2wasm could turn these into a boolean property.
- Only one of the 10 cases I looked at manually was "interesting" in that it had two functions that were not trivial in any way. I looked at their vtables, and each appears in exactly one vtable in one itable. So we could do exactly what was mentioned before and use the parent object.
There might still be potential to optimize with function comparisons, but it doesn't seem that promising.
Thanks, @kripken! It sounds like the next step here is to follow-up on the other optimization opportunities you've found so far then measure again to see how many cases are left where comparing function references directly could still help. Does that sound right?
Yes. But given the low amount of promise here I'd be fine with leaving this to after the MVP, as I said before.
(For posterity, here is the hackish code I used to measure this, if I or someone else wants to remeasure later).