wasmtime Vector register endianness causing ABI incompatibility

Vector register endianness causing ABI incompatibility

Open uweigand opened this issue 1 year ago • 18 comments

Currently, s390x is the only big-endian platform supported by cranelift, which has already caused problems related to memory accesses in the past, since wasmtime defines all data in its memory to be in little-endian byte order.

I had assumed that register contents are not affected by byte order, and while this is true for scalar types, it turns out this is not actually the case for vector registers. This is because of a combination of two issues:

There are IR instructions that directly refer to vector register lanes by number; and
There are IR instructions to re-interpret a vector register as any other vector type, including one with different lane numbers

The combination of these two makes ISA byte order properties of vector registers visible to IR code, just like byte order in memory is visible via a combination of using memory addresses to access (parts of) a value in memory in a different type than it was stored.

Specifically, in the Intel (or ARM) little-endian ISA, if you load e.g. a I32X4 into a vector register, use raw_bitcast to re-interpret that register as I8X16, and retrieve lane 0 of the resulting value via extractlane, the result will be the least-significant byte of the I32 in lane 0 of the original value. And in fact, wasmtime requires this to be the case.

On the other hand, on s390x the content of a vector register in I32X4 mode will be a sequence of four I32, each in big-endian byte order (or else the arithmetic operations on the register will give wrong results). Therefore, the same sequence of operations as above will return the most-significant byte of the I32 in lane 0. This actually caused many wasmtime tests to fail.

To fix this, my current implementation of SIMD instructions on s390x uses a trick combining two different aspects:

When loading the I32X4 from little-endian memory into a vector register, I'm byte-reversing all 16 bytes of the loaded value. This not only fixes each I32 value to be in the correct big-endian order in the register so subsequent arithmetic will work, it also implicitly swaps the order of elements, i.e. the element in slot 0 in memory will end up in what the ISA considers slot 3 of the register etc.
The implementation of all IR instructions that uses explicit lane numbers will be aware of this renumbering, and implicitly revert it to get back to the lanes the code intends to access, so e.g. using extractlane for lane 0 of a I32X4 will actually at the ISA level extract lane 3 of the register.

The combination of these two aspects makes accessing SIMD registers work correctly for wasmtime. For example, in the above case, accessing lane 0 of a I8X16 is converted to lane 15 of the register, which holds the least-significant byte of the I32 in lane 3 of the register, which was loaded from lane 0 in memory -- so in the end we return the least-significant byte of the I32 in lane 0 of the original value, as expected by wasmtime.

However, in implementing support for rustc_codegen_cranelift, I noticed that the current implementation actually breaks when handling big-endian vectors in memory - this will be the case for rustc, since that uses native platform byte order everywhere. Specifically, when loading a big-endian vector from memory, I'm just loading the bytes unchanged. This means that e.g. lane 0 of a I32X4 in memory ends up in its original byte order (which is OK since this is already big-endian) in lane 0 of the register - but the latter is a problem if subsequent code wants to extract that lane, since an extractlane 0 will actually access lane 3 in the register as described above!

To work around that problem, I've implemented a patch that will perform big-endian vector loads by swapping the order of the elements, but not the byte order within any single element. This will cause lane 0 from memory to end up in lane 3 in the register, and makes extractlane work as expected again.

With that fix, now both wasmtime and rustc_codegen_cranelift pass all their existing SIMD tests. Unfortunately, I think this is still not quite a complete solution.

Specifically, we can now run into two additional problems with big-endian code, which apparently are just never triggered by the existing rustc_codegen_cranelift tests.

First, I guess it would be possible to re-interpret contents in a vector register in another type even in rustc. Now, as opposed to wasmtime, rustc uses native endianness, presumably also w.r.t. vector contents. Therefore, the semantics of such a re-interpretation would be platform-defined and differ between big- and little-endian platforms (which is probably why it's not frequently used). However, users would expect this platform-specific behavior to be the same between the LLVM and cranelift back ends to rustc - which in the current implementation it would not be.

Even more problematic, carrying vector elements in reverse order in vector registers actually affects the ABI, as vector types are passed in vector registers. Code compiled by rustc using the LLVM back end would expect those to be in the "normal" platform order, while code compiled by rustc using the cranelift back end would expect them to be the "reverse" order.

One option I'm thinking of would be to actually implement both methods in the cranelift back end. Specifically, the back end could support both a "vector big-endian" and "vector little-endian" mode, where the "big-endian" mode would use lane numbers directly as defined by our ISA, while the "little-endian" mode would use the reverse ordering implemented by the current back end code.

There's a slight complication in that we might have to support both big- and little-endian vector modes in load and store operations accessing any combination of big- and little-endian memory locations. But that should be possible:

vector mode      memory byte order    load/store operation
big-endian       big-endian           direct   
big-endian       little-endian        byte-reverse each element
little-endian    big-endian           reverse order of elements 
little-endian    little-endian        byte-reverse 16-byte input

(Starting with z15 we actually can implement each of these operations using a single instruction, so that should also be efficient.)

The one remaining question is, how to select between the "vector big-endian" and "vector little-endian" modes? There are no attributes (like MemFlags) on the extractlane etc. operations, and that wouldn't even make sense: this is a global property, if you used big-endian mode to load the vector you must also use big-endian mode on all subsequent operations.

So I guess this would have to be some sort of global flag, which wasmtime would set to always little-endian, and rustc would leave at native byte order. Of course, this flag is actually ABI changing, so the same setting must be used for code to remain link-compatible. But I think given the above use cases, that should actually be fine.

FYI @cfallin - this is another problem I ran into while enabling rustc_codegen_cranelift - I'd appreciate any comments or suggestions!

Jul 31 '22 14:07 uweigand

Thanks for writing this up and for the discussion in today's Cranelift meeting. I'm not sure yet which approach I favor, but I thought it might be useful to write out a few principles that I want to try to stick to:

I'd prefer for all behavior to be deterministic and fully specified as CLIF semantics, independently of the target platform. This I think you're already considering above with respect to observability, which is great. In general, I want us to make endianness explicit and remove any "native" or "up to the target" options (as in the currently dormant but still valid issue #3369).
I also prefer the idea that behavior is specified on instructions or types rather than as a global or per-function mode, where possible. In general if we define a value and the heap as buckets of bits we can describe big-endian and little-endian loads as operations on those bits, and then where we largely deal with little-endian data (because of Wasm) on a big-endian target (e.g. s390x), the individual operations have byte-swapping where needed.

I had mentioned the idea of an encoding in the vector types in the meeting earlier today and there was some issue that you had raised but I don't remember exactly what it was; so perhaps it's worth going through it again here. The idea is to have variants I32X4LE and I32X4BE (for example), and a model that a 128-bit vector register is otherwise just an array of 16 bytes loaded directly from memory (in Wasm terms, a V128 type).

That does imply the existence of two different ops iadd.i32x4le and iadd.i32x4be, though, and I suppose the issue could be that s390x does not have an efficient single-instruction lowering for iadd.i32x4le (i.e., each lane is byte-swapped from its perspective)? That's likely why you had done the byte-swapping on load originally?

Perhaps another approach is to eliminate the notion of bitcasting being a no-op on SIMD vectors. That seems like the other root of the issue here: the actual bit positions change if we have a forward-lane-order, reverse-byte-order-within-lane version of an i32x4 vs. i64x2.

Maybe combining some of the above ideas, including the types, we can capture the actual format of the data in the type and require conversions like:

    v1 := load.v128 addr+0
    v2 := vec_reinterpret_le.i32x4 v1    ;; this converts an i32x4le into s390x's internal representation: BE lanes, reverse lane order. `le` and `be` variants of instruction; controlling type determines lane width/count
    v3 := extractlane v2, 1 ;; accesses hardware lane 3-1, because v2's type is i32x4 (4 lanes) and lanes are reversed
    v4 := iadd v2, v2   ;; native vector add instruction, with big-endian lanes
    v5 := vec_reinterpret_le.v128 v4  ;; convert an i32x4 in its internal representation back to a v128 with little-endian ordering
    store v5, addr

So this sketch includes (i) a new v128 type whose semantics are as-if an array of 16 bytes, so it can become a native 128-bit load/store on any platform of any endianness; (ii) vec_reinterpret_le and vec_reinterpret_be (or perhaps separate opcodes for v128-to-typed and typed-to-v128) that can be no-ops or byteswaps depending on platform; and (iii) a definition of types like i32x4 on s390x that the type uses your byte-swapped, lane-swapped scheme.

Or a slight variation on the above: i32x4 and i32x4rev types, where i32x4rev means lane 0,1,2,3 are in lane 3,2,1,0. Then vec_reinterpret_le.i32x4rev on a v128 is a 16-byte-swap (lane-agnostic) on s390x, and is used by the Wasm frontend; and vec_reinterpret_be.i32x4 on a v128 is a no-op, and is used by the Rust frontend.

Then finally at ABI boundaries and at loads/stores we only allow v128, and the frontends keep this in the same format (endianness and lane order) as in memory.

Anyway, hopefully one of the ideas above contains a kernel of usefulness -- eager to hear what you think @uweigand.

Aug 01 '22 19:08 cfallin

I had mentioned the idea of an encoding in the vector types in the meeting earlier today and there was some issue that you had raised but I don't remember exactly what it was; so perhaps it's worth going through it again here

I was probably momentarily confused there. Let me try to systematically go through the various representation options to see what makes sense:

A value of vector type may reside in memory or in a register
(Except for the I8X16 case,) there are two ways to arrange the bytes within each lanes ("byte order")
(Except for the I128 case,) there are two ways to arrange the lanes themselves ("lane order")

This would result in 8 different cases to be considered. However, I believe half of those can be quickly eliminated since they serve no useful purpose:

For vector values in memory, only one lane order is useful: the one that matches array order (lane 0 at low addreses)
For vector values in registers, only one byte order is useful: the one that matches the native machine byte order

As a result, we can conclude:

The representation of vector values in memory is determined by specifying a byte order - and we already do this today via the MemFlags carried by all instructions that access memory.
The representation of vector values in registers is determined by specifying a lane order - this remains the open question.

The next question is, how to characterize the two "lane order" options for values in registers. First, we notice that if the vector is interpreted as I128, there is actually only one option, because that type only has a single lane. Next, we can define lane order for other types in terms of re-interpreting the bit pattern as I128:

If lane 0 of a vector type occupies the least-significant bytes of the re-interpreted I128 value, we call it "little-endian lane order".
If lane 0 of a vector type occupies the most-significant bytes of the re-interpreted I128 value, we call it "big-endian lane order".

This definition ensures in particular that we can match WebAssembly vector conversion semantics by always using vector types in little-endian lane order (when in registers). In order to match semantics requires by native vector code (e.g. for the Rust front end), we need to use whatever lane order is considered the "native" default for the platform. It appears that this matches the native byte order for all platforms I'm aware of (i.e. all platforms with native little-endian byte order seem to be using native little-endian lane order, and all platforms with native big-endian byte order seem to be using native big-endian lane order), but that's actually not necessarily a requirement.

So far, we have not looked at ISA instructions at all. First, we note that most operations are in fact independent of the lane order: an iadd.i32x4 in little-endian lane order vs. an iadd.i32x4 in big-endian lane order are exactly the same operation and can be implemented by exactly the same target instruction. There are only two types of instructions where lane order is relevant:

Memory loads and stores.
Instructions that explicitly refer to lane numbers (insertlane, extractlane, swizzle, and shuffle).

For the memory operations, we have four versions each, depending on the memory byte-order and the register lane-order. These match the four versions described in the table in the original issue text. For the explicit lane number operations, we'd have two versions each, depending on the lane order.

Finally, we have to look at ABI considerations. Obviously, if a value of vector type crosses an ABI boundary, there are ABI concerns: caller and callee must agree on the representation, which means they must agree on byte order if the value is passed in memory, and they must agree on lane order if the value is passed in register.

For in-memory byte order, that's actually just business as usual since that applies to all other types as well. [ As an aside, the situation is a bit more complex than one might first think: for values explicitly passed as parameters, we always use native byte order (even with the Wasmtime ABI, all values on the stack are always stored in native byte order currently), but for values used more indirectly (e.g. from a global variable or via a pointer), platform assumptions are in play (Wasmtime uses LE for such values, native code uses native byte order). ]

The in-register lane order for vector types at ABI boundaries is a new question to be asked. For native code, we must follow existing precedent, which requires native lane order (whatever that is on any given platform). For Wasmtime, if we're lucky the issue does not occur, since Wasmtime only supports a single v128 type at ABI boundaries, which is effectively I128, and therefore its representation in register does not depend on lane order. (However, there might be a hitch in that the Wasmtime translator currently uses I8X16 to represent the Wasmtime v128 type - and I8X16 does depend on lane order ...)

All that said, we still have to decide how to represent lane order information.

Option 1: Global or per-function setting

I understand you don't like this option, but it would certainly be feasible. In particular, it seems to me now that tying the lane order to the function ABI might be a good option, given that:

The lane order at ABI boundaries is in any case a part of the visible ABI.
Usually, a function will use the same lane order throughout, so it makes sense to use the lane order at ABI boundaries also as lane order for the whole function.
This could actually be implemented as a s390x back-end only change ("if the current function uses the Wasmtime ABI, it will use little-endian lane order throughout, otherwise big-endian lane order"), affecting nothing else.

Option 2: Types

It appears it would indeed also be possible to use types to encode the lane order information. We'd have e.g. i32x4le and i32x4be (where le and be of course refer to lane order, not byte order), and then implement the above-mentioned instructions according to the lane order implied by the type.

This option might have some less-desirable consequences:

For swizzle and shuffle, we may run into the case that the two (or three?) vector input registers have types implying different lane orders. Would we have to fully support all combinations here? This might not make much sense ...
The le and be types would only affect in-register representation. The in-memory representation would be identical, which might make it a bit surprising to use two different types ...
This option implies a large-scale change to CLIF IR and every user of cranelift, since all uses of vector types would have to be rewritten.
Also, we'd need a way for native front ends like Rust to specific "native" lane order. Either by querying ISA properties, or else by having a separate "native" set of types (similar to what we do today for byte order).

Option 3: Instructions

Another option would be use different instructions. This basically means we'd have either two copies of the affected instructions (memory ops, lane number ops), or else they'd all get some flags argument. Either way, the lane order to be used would be hard-coded into the instruction itself.

Some disadvantages of this option:

There would be no automatic validation that the inputs to an operation actually are in the lane order that the instruction assumes. (In the case of types, the validator could ensure that.)
There's still a need to change all users of cranelift, but possibly to a lesser extent than the type option.
Similarly, there's still a need to specify "native" byte order.

Option 4: Explicit conversion

Perhaps another approach is to eliminate the notion of bitcasting being a no-op on SIMD vectors.

Sure, if raw_bitcast is no longer a no-op, much of the above would not be necessary: we'd simply use native lane orders in registers always, and then implement raw_bitcast as permutation.

However, we'd still have to know whether the user intended raw_bitcast to transform between types in little-endian or big-endian lane order, because it would still be a no-op in one case but not the other. So we still have to solve the question of how to specify lane order. (I guess if we chose the per instruction method this might become a bit simpler, since we'd now just need two versions of raw_bitcast, every other instruction could stay the same.)

But I really don't like that option, since it would significantly degrade performance of Wasmtime on BE machines. Some instances of raw_bitcast could probably be optimized away (e.g. folded into an adjacent load or store), but the Wasmtime translator seems to be creating a lot of raw_bitcast, all over the place.

As to your code example, I don't see how that would solve the full problem. You're doing a vec_reinterpret_le or vec_reinterpret_be adjacent to the loads and stores -- that is effectively the same as having two flavors of load and store.

However, that doesn't solve the problem because we also need two flavors of extractlane! To have a somewhat contrived example, let's assume we have code that adds two i32x4 vectors from memory, bitcasts the result to i8x16, and extracts lane 0.

In Wasmtime semantics, this is supposed to return the LSB of the sum of lanes 0 of the two inputs. In native semantics, this is supposed to return the MSB of the sum of lanes 0 of the two inputs.

The code sequences I want to generate to efficiently implement both operations would be, for the native case:

   v1 := load.i32x4 x             ;; 16 byte native load
   v2 := load.i32x4 y             ;; same
   v3 := iadd.i32x4 v1, v2        ;; native SIMD instruction
   v4 := raw_bitcast.i8x16 v3     ;; no-op
   v5 := extractlane.i8x16 v4, 0  ;; accesses hardware lane 0

and for the Wasmtime case:

   v1 := load.i32x4 x             ;; 16 byte fully byte-reversed load
   v2 := load.i32x4 y             ;; same
   v3 := iadd.i32x4 v1, v2        ;; native SIMD instruction
   v4 := raw_bitcast.i8x16 v3     ;; no-op
   v5 := extractlane.i8x16 v4, 0  ;; accesses hardware lane 3

This only works if both the load and the extractlane are affected by the lane order choice.

Aug 02 '22 18:08 uweigand

Thank you again @uweigand for the detailed space-exploration here; I need to think more about this. Perhaps we can schedule some more time to talk about it in a Cranelift meeting?

I will say now that one concern I have with a function-level or global-level mode is composability: if the meaning of instructions depends on the environment (func or larger scope) and that environment has only one setting, then we can no longer write IR that uses both behaviors, and we can no longer copy instructions between functions. That may not be a common operation now, but it will become a first-order concern when we build an inlining pass. I could imagine a (far) future where we have native runtime code compiled from e.g. a Rust frontend, and Wasm code compiled from the Wasm frontend, and we want to perhaps inline across them. So from a semantics and clean-design point of view, I'd prefer for instructions to encapsulate their entire meaning, without influence from environment.

That leaves either the types or some update to the opcodes; I'm not sure which is the better option at this point. More discussion and thought needed!

Aug 03 '22 18:08 cfallin

Thanks for looking at this, @cfallin ! Happy to have a discussion in the meeting - I'd be available next Monday (Aug 8).

Aug 03 '22 18:08 uweigand

The next question is, how to characterize the two "lane order" options for values in registers. First, we notice that if the vector is interpreted as I128, there is actually only one option, because that type only has a single lane. Next, we can define lane order for other types in terms of re-interpreting the bit pattern as I128:
* If lane 0 of a vector type occupies the _least_-significant bytes of the re-interpreted `I128` value, we call it "little-endian lane order".

* If lane 0 of a vector type occupies the _most_-significant bytes of the re-interpreted `I128` value, we call it "big-endian lane order".
This definition ensures in particular that we can match WebAssembly vector conversion semantics by always using vector types in little-endian lane order (when in registers). In order to match semantics requires by native vector code (e.g. for the Rust front end), we need to use whatever lane order is considered the "native" default for the platform. It appears that this matches the native byte order for all platforms I'm aware of (i.e. all platforms with native little-endian byte order seem to be using native little-endian lane order, and all platforms with native big-endian byte order seem to be using native big-endian lane order), but that's actually not necessarily a requirement.

As I mentioned during the last Cranelift meeting, I think that when data endianness is set to big-endian, the 64-bit Arm architecture does not fit into either of those categories. Consider an indexed vector multiplication, i.e. a variant of the MUL instruction, as an example - both the instruction description and pseudocode do not mention endianness, so if we are dealing with 16-bit elements, then element 0 occupies bits 0 to 15 out of the 128-bit vector.

Now, we do not have any plans to implement big-endian support inside Cranelift's AArch64 backend in the near future (big-endian platforms are somewhat exotic and in fact big-endian support is optional in the architecture), but I think that this information might inform whatever design we come up with.

Option 1: Global or per-function setting

I understand you don't like this option, but it would certainly be feasible.

The option I mentioned during the meeting was to introduce an ISA-specific setting in cranelift/codegen/meta/src/isa/s390x.rs, for instance, but it seems that it is not flexible enough to support the use case of mixing Rust and Wasm code that @cfallin is talking about.

Aug 03 '22 20:08 akirilov-arm

This definition ensures in particular that we can match WebAssembly vector conversion semantics by always using vector types in little-endian lane order (when in registers). In order to match semantics requires by native vector code (e.g. for the Rust front end), we need to use whatever lane order is considered the "native" default for the platform. It appears that this matches the native byte order for all platforms I'm aware of (i.e. all platforms with native little-endian byte order seem to be using native little-endian lane order, and all platforms with native big-endian byte order seem to be using native big-endian lane order), but that's actually not necessarily a requirement.

As I mentioned during the last Cranelift meeting, I think that when data endianness is set to big-endian, the 64-bit Arm architecture does not fit into either of those categories. Consider an indexed vector multiplication, i.e. a variant of the MUL instruction, as an example - both the instruction description and pseudocode do not mention endianness, so if we are dealing with 16-bit elements, then element 0 occupies bits 0 to 15 out of the 128-bit vector.

Thanks for the pointer! From my reading of the document, this simply means that AArch64 always uses a native vector lane ordering (according to my definition above) of little-endian, independent on whether the default byte order is little-endian or big-endian.

Now that I think about it, Power is actually similar: Power supports both big-endian and little-endian default byte order, but the native vector lane ordering is always big-endian.

So my statement quoted above that native lane order seems to always match default byte order is definitely not true :-) But as I said, it's not actually a requirement. This simply means that we need to provide both default byte order and native vector lane order as independent parameters of an ISA.

Aug 03 '22 22:08 uweigand

Thinking about this some more, I have some further comments on this option:

Option 4: Explicit conversion

Perhaps another approach is to eliminate the notion of bitcasting being a no-op on SIMD vectors.

Sure, if raw_bitcast is no longer a no-op, much of the above would not be necessary: we'd simply use native lane orders in registers always, and then implement raw_bitcast as permutation.

However, we'd still have to know whether the user intended raw_bitcast to transform between types in little-endian or big-endian lane order, because it would still be a no-op in one case but not the other. So we still have to solve the question of how to specify lane order. (I guess if we chose the per instruction method this might become a bit simpler, since we'd now just need two versions of raw_bitcast, every other instruction could stay the same.)

Ignoring efficiency aspects for a moment, this option may actually be the cleanest, semantically, after all. To be specific, the option would consist of replacing raw_bitcast with two new IR instructions, called something along the lines of bitcast_le_lanes and bitcast_be_lanes (better naming suggestions welcome!). This would be the sole change, there are no changes to any other instructions, no changes to the type system, and no global or implied parameters anywhere.

The semantics of the new instructions would be:

bitcast_le_lanes: reinterpret the bit pattern of vector type A, assumed to be represented in little-endian lane order, as vector type B, also represented in little-endian lane order
bitcast_be_lanes: same, but using big-endian lane order

The semantics of the two instructions would then actually be fully well-defined, and not dependent on any implicit platform properties (unlike the current raw_bitcast).

If the requested lane order matches the lane order the implementation chose to hold vector values in vector registers, the operation remains a no-op; otherwise, the implementation is a permutation. Specifically, a bitcast_[bl]e_lanes from type X to type Y is either a no-op, or a sequence of "swap lanes of type X" + "no-op reinterpretation" + "swap lanes of type Y".

Now, lets look back at implementation efficiency. As I noted above, a naive direct implementation of the semantics as just defined would be very inefficient, certainly in the case of Wasmtime on BE lane order systems. However, there may be options to provide optimizations to make the implementation more efficient - without any change to the semantics! (And therefore, without affecting the properties you were concerned about, @cfallin.)

Specifically, I can think of options to optimize implementation:

Explicit swap_lanes instructions

We could, as an optimization pass, do the following:

Replace the bitcast_[bl]e_lanes instructions with a pair of explicit swap_lanes instructions as per the semantics above.
Note that swap_lanes commutes directly over all "normal" vector instructions, and commutes over vector instructions using explicit lane numbers when adding the appropriate lane number compensation operations. Also, two adjacent swap_lanes operations on the same type cancel each other, and a swap_lanes of a single-lane type is a no-op. These are just generally valid, semantics-preserving transformations of the IR.
Using a series of such transformations, we can push all swap_lanes up or down all the way to the original sources and sinks of the data flow graph. Those would be arguments and return values, and memory loads and stores. For the latter we can then use appropriate instruction selection to merge the swaps into the memory operation (if the ISA has such instructions). For arguments and return values, the swaps remain (however in the Wasmtime special case, we should always see the single-lane V128 type here, so those swaps would also be no-ops).

That is something that could possibly be done within the new egraph-based framework. But maybe all that compile-time overhead isn't actually necessary either, because we can just do this instead:

Implementation choice of lane order (semantics preserving)

With the above model, Cranelift IR has well-defined semantics that does not depend on global parameters. However, the code generation back-end is free to chose to emit whatever instruction sequence it wants - assuming it in the end implements that well-defined semantics.

In particular, the implementation is actually always free to choose whether to represent vector values in vector registers in little-endian or big-endian lane order! For example, on a platform like s390x with native big-endian lane order, I can implement Cranelift IR semantics as:

Assuming the ABI prescribes vector values are in native (big-endian) lane order at ABI boundaries, just leave them that way.
Implement memory operations and lane number related operations assuming big-endian lane order.
Implement bitcast_le_lanes as explicit permutation.
Implement bitcast_be_lanes as no-op.

However, I also can implement the very same Cranelift IR semantics by doing instead:

Lane-swap all vector values at the ABI boundary.
Implement memory operations and lane number related operations assuming little-endian lane order.
Implement bitcast_le_lanes as no-op.
Implement bitcast_be_lanes as explicit permutation.

This just results in a different instruction sequence for the current function, but implementing the exact same semantics.

Now, since this is just an implementation choice, I'm free to use whatever heuristics make most sense to decide which way to go. For example, if a function contains many bitcast_le_lanes and no bitcast_be_lanes, it would likely be beneficial to chose the implementation that makes bitcast_le_lanes a no-op, and vice versa.

In fact, I could even just use the current function's ABI as input to that heuristic, and simply use little-endian lane order for functions using the Wasmtime ABI, and big-endian lane order otherwise. Combined with the fact that in the Wasmtime ABI, we only have the single-lane V128 at ABI boundaries and therefore those swaps are no-ops, I would end up with exactly the same generated code, and basically no compile time overhead either, as I would have in the option described above as "Option 1: Global or per-function setting".

But at the same time, there actually is no global state and Cranelift IR semantics is locally well-defined. While I do look at the ABI to make a decision, the ABI only influences implementation choice heuristics, not semantics.

Aug 03 '22 23:08 uweigand

Thanks for looking at this, @cfallin ! Happy to have a discussion in the meeting - I'd be available next Monday (Aug 8).

Hi @cfallin for reference here's the set of charts I shared in today's meetings: Vector-Endian.pdf

Aug 08 '22 16:08 uweigand

The remaining open part of this issue is the question of how to define the new IR instruction(s) to implement bitcasts with explicitly specified lane order. In preparation for this, I've been reviewing the existing bitcast and raw_bitcast instructions.

First, we have bitcast, which is specified as

        Reinterpret the bits in `x` as a different type.

        The input and output types must be storable to memory and of the same
        size. A bitcast is equivalent to storing one type and loading the other
        type from the same address.

However, the implementation doesn't really match the specification in various aspects:

At the parser level, bitcast does accept exactly the set of types storable to memory (i.e. basically everything except bool), however the size check is only partially enforced: any set of types with lane_bits(dest) >= lane_bits(src) is accepted by the validator.
At the implementation level in all three back-ends, only a very limited set of types is actually supported: bitcasts between I32 and F32, and between I64 and F64 respectively. (These are implemented as cross-register-bank moves.)
Reviewing all users of bitcast, I find that is it currently also only used for these particular combinations: to implement the F32ReinterpretI32 etc. WebAssembly instructions, and in cg_clif to implement transmutes between exactly those types.

Then, separately, we have raw_bitcast, which is specified as:

        Cast the bits in `x` as a different type of the same bit width.

        This instruction does not change the data's representation but allows
        data in registers to be used as different types, e.g. an i32x4 as a
        b8x16. The only constraint on the result `a` is that it can be
        `raw_bitcast` back to the original type. Also, in a raw_bitcast between
        vector types with the same number of lanes, the value of each result
        lane is a raw_bitcast of the corresponding operand lane. TODO there is
        currently no mechanism for enforcing the bit width constraint.

Reading the text, it is somewhat unclear what exactly, on the specification level, the difference between bitcast and raw_bitcast is intended to be. The only clearly mentioned difference is that raw_bitcast is explicitly supported on boolean types (or at least boolean vector types), which are not supported with bitcast as they cannot be stored to memory.

Looking at the actual implementation, we see:

Boolean types are in fact explicitly supported (but for some reason, dynamic SIMD types are not, so this is not a clear superset of the set of types supported by bitcast either).
There is absolutely no check on type sizes (or lane number, for that matter) during validation.
All back-ends implement this as a pure no-op. This means bitcast cannot be replaced by raw_bitcast on those combinations where cross-register-bank moves would be required, even though the specification doesn't seem to exclude this directly.
raw_bitcast is used in a single place in cg_clif to implement transmutes between (any pair of) vector types. It is frequently used by the wasm front-end to map between the WebAssembly V128 and various CLIF vector types. In particular, it is definitely used for boolean vector types.

In addition, there are two places in cranelift middle-end common code that match and/or generate raw_bitcast:

simplify in simple_preopt.rs, when performing the bitselect to vselect optimization.
when constructing a vector bitselect during NaN canonicalization.

It seems to me that both of these aren't really necessary: I think the NaN canonicalization can just be done without bitcasts, and the bitselect to vselect optimization is only useful on x64 anyway and should really be done as an ISLE rule in that backend.

In summary, it appears that the distinction between bitselect and raw_bitselect has been driven primarily by the particular implementation (int vs. float register bank moves on the one hand, and no-op re-interpretation of vector types on the other). This doesn't seem an ideal choice for providing well-defined general operation sematics, though ...

The specification in bitcast reading A bitcast is equivalent to storing one type and loading the other type from the same address. actually does look very reasonable to me. And in particular, this would naturally handle the lane order distinction if the storing and loading operation mentioned there would be interpreted as using a particular byte order. From that perspective, it might even make sense to just add MemFlags to the bitcast operation ...

However, the one place this breaks down is that we cannot handle boolean types, since those cannot be stored to or loaded from memory. But that in itself seems a bit of an inconsistency: I understand the reason for this restriction is that the actual bit pattern used to implement true and false values was intended to be platform-specific, and therefore there should be no CLIF IR operation that exposes this bit pattern. But that's no longer true anyway as you can just raw_bitcast them. And in fact, the wasm implementation of vector compare operations is only correct if back-ends use one particular implementation (namely, all 0 bits for false, and all 1 bits for true).

So I think we either should make this an actual requirement, and then allow loading/storing those types, or else, we disallow raw_bitcast on boolean (vector) types and instead provide a new instruction similar to bint (except using 0/-1 instead of 0/1) that the wasm frontend could use to implement WebAssembly semantics. In both cases, there would be no more need to distinguish between bitcast and raw_bitcast on an IR level. Of course, back-ends would still need to implement both the cross-register-bank move and the in-register no-op semantics - but that distinction could be made via the type instead of the operation.

Any comments or suggestions welcome!

Aug 31 '22 13:08 uweigand

@uweigand Are you aware of PR #4820?

IMHO the only difference between bitcast and raw_bitcast is that the latter is allowed to change the number of lanes (while still keeping the overall size of the type the same). The fact that all backends implement raw_bitcast as a no-op is a bug, precisely because, as you imply, that's wrong for a cast from F64 to I64, for example. Of course, given the main use case of the operation (lane number change), we never hit this issue in practice. I am surprised that a fuzzer hasn't started to, though.

Aug 31 '22 15:08 akirilov-arm

It looks like cranelift-fuzzgen doesn't generate either variant of bitcast yet, and the other fuzzers generate wasm rather than CLIF. So I'm not surprised that fuzzing hasn't exercised this part of the CLIF specification.

Aug 31 '22 15:08 jameysharp

@uweigand Are you aware of PR #4820?

I wasn't - thanks for the pointer!

IMHO the only difference between bitcast and raw_bitcast is that the latter is allowed to change the number of lanes (while still keeping the overall size of the type the same).

Well, that, and of course the boolean type issue I mentioned.

Aug 31 '22 16:08 uweigand

Ah, I see, my memory is fuzzy, and I feel that this was discussed before, but is the boolean thing basically the only real reason for the distinction between bitcast and raw_bitcast? The latter is strictly more general than the former, so I suppose having bitcast just helps with type checking (oh, and I am intentionally ignoring what the backend implementations look like).

P.S. IMHO the actual implementations of both bitcast and raw_bitcast should be exactly the same, at least in the AArch64 case, and PR #4820 contains the code (though it doesn't touch the raw_bitcast lowering, since @dheaton-arm doesn't necessarily share my opinion).

P.P.S. I just noticed the difference with respect to dynamic vector type support - @sparker-arm, was this just an accidental omission?

Aug 31 '22 17:08 akirilov-arm

I never thought bitcasts could be so complicated until I read the clif docs :)

It sounds like the standarization of booleans, hopefully enabling storage, is the only thing preventing one cast to rule them all?

RE: dynamic vectors, the fact that bitcasts are currently supported and raw_bitcasts are not is purely coincidental. Load/store to stack slots were implemented so the Mem type had dynamic vectors built in, the same type that we use bitcast.

I'm don't believe any casts should even apply to dynamic vectors as, at the IR level, we can't tell whether two dynamic types have the equivalent sizes.

Sep 01 '22 09:09 sparker-arm

@uweigand In case you aren't aware, there was some discussion on Zulip about bitcast/raw_bitcast as well, which had some conclusions on the differences between bitcast and raw_bitcast.

Sep 01 '22 14:09 dheaton-arm

From that perspective, it might even make sense to just add MemFlags to the bitcast operation ...

This makes sense to me.

However, the one place this breaks down is that we cannot handle boolean types, since those cannot be stored to or loaded from memory. But that in itself seems a bit of an inconsistency: I understand the reason for this restriction is that the actual bit pattern used to implement true and false values was intended to be platform-specific, and therefore there should be no CLIF IR operation that exposes this bit pattern. But that's no longer true anyway as you can just raw_bitcast them. And in fact, the wasm implementation of vector compare operations is only correct if back-ends use one particular implementation (namely, all 0 bits for false, and all 1 bits for true).

I would actually be in favor of investigating whether we can completely remove boolean types. They are always the special, odd case. I'm not convinced that we couldn't just get by with i8 instead, as the other scalar booleans don't seem particularly useful and the vector booleans even more so. It would certainly let us clean up a ton of things.

I would also be in favor of investigating whether we can remove i8x16 et al and just have v128 like Wasm has. I don't think the types are actually used by anything and we just have to insert these raw_bitcasts all over the place. What are they buying us? Although maybe this isn't possible because of lane ordering. I clearly haven't thought too deeply about this, but if it is practical, then I'm all for it.

Sep 01 '22 20:09 fitzgen

The typed are used for determining what kind of operation (lane size, lane count) to perform when you have eg an iadd instruction. Also cg_clif doesn't need bitcasting for vectors nearly as much as rust uses unique types for each vector lane count+width.

Sep 01 '22 20:09 bjorn3

Weighing in a bit late here, sorry (and thank you @uweigand for the detailed analysis as always!) -- I think I'm converging toward the following thoughts:

Indeed, booleans are best addressed by removing them altogether. The BNxM vector-bool types and Bn wide-bool types for n > 1 never sat well with me: the intent is clearly to have some sort of mask value, but these types are n-bit containers where all but two of the 2^n possible values are illegal. As we explored in #3205 (an issue which just celebrated its first birthday -- clearly this stuff is not easy), that implies either some sort of validation -- on load? on conversion into a wide bool type? -- or undefined behavior in some cases.

The bitmask-like cases for wide bools can IMHO be satisfied by integer-typed SIMD values, and the cases were we need to communicate a value from an icmp to a "bool" consumer, like select, can pattern-match the (select (icmp ...)) as they do today and work from a reified zero/nonzero value otherwise (any nonzero -> true).

I also suspect that our support for bool types is patchy at best -- there are lowerings in some places tested by tests, but bools have never been as well-exercised as ints. Simplifying our set of types gets us closer to full coverage in one big step.
If we remove boolean types altogether, then we no longer need the distinction between the two bitcasts. At that point I favor calling it bitcast; the raw_ is somewhat confusing (isn't a bit-level pun operation at some level "raw" no matter what?).
Adding a MemFlags to the bitcast instruction is a pretty novel idea, and almost follows from the "store-then-load" equivalent sequence, I think, except that the underlying store and load could have different flags. Similarly some of the flags wouldn't matter (e.g., notrap and aligned -- the operation would be defined in terms of a hypothetical memory location that is always valid and aligned), and I generally prefer not to have representable variants that don't have semantic meaning. Instead could we make endianness part of the cast? This I think subsumes @uweigand's proposed ops above, with the LE / BE variants. The endianness is irrelevant for scalar casts in practice (and let's say by definition if we want: I don't think there are machines that use one endianness for I64 and another for F64?) but matters for vector lane order (see rest of this issue!).

If this seems reasonable, then I think the order of operations is (i) remove bools, (ii) remove raw_bitcast and move vector use-cases over to bitcast, merging the lowerings for the different cases in the backends, then (iii) add the notion of endianness to bitcast.

Thoughts?

Sep 01 '22 21:09 cfallin

wasmtime wasmtime copied to clipboard

Vector register endianness causing ABI incompatibility

wasmtime
wasmtime copied to clipboard