simd
simd copied to clipboard
Inefficient x64 codegen for bitselect
I guess the bit-select semantics were chosen because ARM does it that way? Sure would be nice if this mapped to a single blendv instruction on x86. Have we considered the following semantics which would allow using both VBSL on ARM and blendv on x86?
The input to bitselect should be a "mask" with all bits either zero or 1. That's typically true anyway (result of comparison instruction), and anyone who has a high-bit-only mask (0x8..0) can right-shift to get the all-bits mask.
Given that requirement, x86 blendv does the right thing - checking the upper bit is sufficient. The operation can lower to a single instruction.
@jan-wassenberg If I understand correctly, the proposed semantics is the same as the current bitselect semantics except that it requires that all mask bits for a particular lane have the same value. Is that correct? If so, would a malformed mask result in a trap? I expect having to validate the mask would lead to similarly suboptimal codegen.
@tlively Right. I agree validating inside bitselect would be expensive. Is undefined behavior generally unacceptable? If so, we could define a Mask type guaranteed by the compiler to have that property. Comparisons would return that type, and we could provide mask<=>simd128 conversions that do validation.
Let's see if I understand:
- if the mask is static (e.g. from
i32x4.const
or a specialmask
instruction?), then a JIT compiler could validate that it is correct before emitting a singleblendv
- if the mask is dynamic,
- it is as the output of a comparison (e.g.
i32x4.eq
), in which case the JIT compiler could guarantee the mask is well-formed and emit a singleblendv
- or it is coming from some other runtime value in which case the user would need to convert it to a mask using some new instruction (
i32x4.to_mask
) with an added cost of ? instructions to validate it (what exactly is an invalid instruction? a combination of 0s and 1s in a lane for ARM?)
- it is as the output of a comparison (e.g.
Yes, WebAssembly cannot have undefined or implementation-defined behavior. Introducing a new mask type could solve the problem, but it would be a large amount of complexity to add to the current proposal. I propose that we keep the current bitselect instruction and defer discussion of a more complex version to a post-MVP SIMD proposal.
@abrown Yes, that's what I was thinking. Valid could mean all_true((mask == 0) | mask).
@tlively I totally understand that this would be a large change with relatively low ROI. It's still worth keeping in mind that masks would ideally be a separate type, if long vectors should ever be extended to AVX-512. (There, masks are single-bit, and it would be inefficient to expand them to vectors after every comparison.)
I totally agree that having a separate mask type would make sense in a follow up SIMD proposal 👍
I understand the complexity argument but I think this would be far easier to do now than later. Once the current semantics for bitselect
are baked in, we would have to create a new instruction for the semantics @jan-wassenberg described--right? And the accompanying confusion from users of this instruction won't help: "we have two instructions doing almost the same thing but one is more efficient?" I don't see how the masked bitselect can be backwards-compatible in any way that doesn't involve a lot more work than we would have to do now. Why not just do this now, even if it does involve some work?
Because it would be a lot of work and we are trying to get SIMD into the hands of our users soon. Creating a new bitselect instruction in the future will be fine. A precedent for that kind of addition is the nontrapping float-to-int instructions, which were added post-MVP when we realized the MVP trapping float-to-int instructions were not as useful. To be clear, I like this idea and would be happy to do it in the future, but I don't think we should delay shipping MVP SIMD to make this change.
@abrown Having different semantics here that can map to one instruction for bitselect either mean introducing a new separate mask type, or validating the mask. The code sequence here for x64 is fairly straightforward, and in both cases uses 1-cycle instructions (we could optimize V8 codegen here to get rid of the mov
), vs. the blend*
instructions on many processes take 2-cycles. A non 1-1 mapping on Intel architectures doesn't always mean suboptimal performance. I'm going to agree with @tlively here that I don't think a change is warranted here for the ROI. But if you think that it would be, please follow up with a concrete proposal of what the operation semantics would be, and preferably performance data to warrant the additional work for the change of semantics.
Very often the mask in bitselect comes from a comparison instruction. In this case x64 code generator can elide the extra instructions and just use blendv directly. On SKX blendv is 1 uop and has throughput of 1, vs and/andnot/or sequence that is 3 uops, but each insn has throughput of 3/clock so it's unclear to me which one is faster. Could be worth an experiment on the engine side though, or is it too much work to trace the "origin" of the value that feeds into bitselect? This is loosely equivalent to a separate mask type, but doesn't require spec changes.
Even if the instructions dispatch in the same cycle with and/andnot/or, they will not retire in the same cycle due to the dependencies. Depending on the surrounding code, it's possible that they will reorder and the extra latency will not be measurable but they do end up using more registers as well and they do make the function longer. Both of these factors can impact whether a function will be inlined or not.
And/andnot/or often can execute on more pipes than blendv although that isn't always the case. For example, on Ryzen logical operations can use pipe 0, 1, 2, or 3 while blendv can use pipe 0 or 1.
I think any benchmark we can cook up to measure this will be synthetic in nature and might not reflect the real world., There might not be an always optimal choice. IMO either way is fine but if I had to pick one, I would opt for blendv. It's what I use my math lib as well.
@zeux In V8, we could trace the origin, but most Wasm engines by design are more simplistic than that, i.e. the operations are expected to map directly to code generation without the need for more optimizations. Though we could argue that we pattern match for shuffles so we could justify adding more optimizations there. I'm just not sure that this particular case warrants that optimization.
@nfrechette Synthetic benchmarks don't represent real world data, but what I was proposing more is that if there is a need to change the semantics, I think a concrete proposal to prototype/ benchmark is required. I do agree that this particular case performance either way is very application dependent, and either way doesn't necessarily outweigh the other significantly, so I'm unwilling to put in too much effort into this particular operation.
@dtig I guess my point was that I don't think we need to change the wasm spec here since the opportunity for better codegen is left open. If an engine does any sort of stack->SSA conversion then this tracking is probably simple; if an engine doesn't do that it's going to miss out on a lot of other opportunities such as shifts.
@zeux I agree that we don't need a spec change here, but not necessarily for the same reasons. My earlier comment was trying to clarify that there were some speculative spec changes discussed earlier with different performance tradeoffs, and none of those seemed significantly better than the other. It was more of a suggestion for @abrown to have a concrete proposal so we can discuss actual tradeoffs or prototype if necessary.
Re. tracking, I think that we could do it in the future as an additional optimization, but wanted to point out that for wasm in general it's been explicit design choice not to do so, and most engines make the same choice.
@dtig
point out that for wasm in general it's been explicit design choice not to do so, and most engines make the same choice.
I do not believe the current shift situation is truly viable without tracking constantness of arguments. On both ARM and x64 shifts take a lot of instructions, up to 9-10, if the engine doesn’t know these are constants. Does this mean we need to introduce constant versions of all shifts, or is that different?
Raising this because I was under the impression that all engines will do this but I am unsure now.
I think I misunderstood the point you're trying to make with tracking the origin - tracking constantness, and tracking the origin to me mean different things. Deducing that the origin is a constant, and making optimizations on constantness is a necessary optimization, which most engines will do (32/64-bit constants are obviously an easier case), separately tracing the origin of the value that feeds into the bitselect, which is also possible if the origin is a compare and to optimize by merging codegen for these, is what I meant in my previous comments when I meant this would be an additional optimization.
I guess I thought that these are the same because for both you need to identify the instruction that pushed the value to the stack slot - in one case it’s a constant integer, and in another case it’s a comparison instruction. Which is why I thought that either it’s reasonable to expect to be able to do both, or neither.
@dtig, I don't feel there is much of a need for making Jan's proposal above more concrete; after clarifying above, it is clear enough to me what would need to happen for us to make this instruction more performant. What isn't clear to me is how these semantics were chosen: is it really as mentioned above "because ARM does it that way?" I think that's the type of thing that needs to be made more concrete.
@zeux, I added a cranelift issue along the lines of what you proposed. Thanks! I think that is the way forward in the short term and I can piece together a vision of how that could work in the current cranelift infrastructure. I don't think we should completely abandon @jan-wassenberg's mask idea (it's a great idea, BTW!) because it will be necessary for applying this optimization through function calls.
And @jan-wassenberg, do you see this "new type" approach being applicable anywhere else?
@zeux I think you could do both, it's a higher level optimization that we disable for Wasm codegen because in general Wasm operations should map very closely to native instructions. Constants at least in V8 are special, and that knowledge is built into input nodes at all levels - so deducing that it's a constant is easier than deducing which operation generates it. This is more an engine implementation detail, but I see how they can both be interchangeable.
@abrown The original semantics came from the time when the SIMD.js proposal was transitioned over to the the one for SIMD in Wasm, and this was before my time. My speculative answer though is that these were chosen as a good starting point, and application feedback would influence tweaking semantics. ASFAIK, this hasn't been problematic for applications with the prototype implementation in V8 being in use for some time. If you are looking for more specific reasons for initial semantics, I would redirect you to @PeterJensen and @arunetm who were involved in the SIMD.js specification.
Thanks @abrown :) I suppose a type could also be useful for swizzle() inputs that are known to be >= 0x80, which would allow using PSHUFB directly. Probably small potatoes, though - much less common/important than 0xFF..FF masks.
What about NaNs? I'm still trying to think this through and it still is very hazy, but would a new type guaranteeing that a float isn't a NaN be possible? If I saw that type I could probably avoid some extra instructions.
Brief summary of the history of bitselect: SIMD.js started with a bit-select, and later added an element-wise select, and later removed bit-select to simplify the initial proposal (with the expectation of re-adding it in the future). With the switch to Wasm SIMD, SIMD.js' types were removed in favor of a single v128
type, but there wasn't a consensus for how to handle booleans so element-wise select was removed, and bit-select was reintroduced to fill the gap.
@abrown Guaranteeing that a float isn't a NaN is hard, because every arithmetic operator has a case a NaN is produced from non-NaN inputs: Infinity + -Infinity
=> NaN, 0 * Infinity
=> NaN, 0 / 0
=> NaN, sqrt(-1)
=> NaN, so you'd need a lot of dynamic checks.
@abrown Is there more that you were looking for here, or can this be closed?
I think the closure for this should be documentation in an "implementation guide."