relaxed-simd SIMD subgroup meeting on 2022-09-30

This is a joint meeting with flexible-vectors and any SIMD related presentations are also welcome, e.g. new architectures or toolchains, longer term discussion that don't fit into flexible-vectors or relaxed-simd.

The meeting will be on Friday, September 30 at 9:00AM - 10:00AM PDT/ 5:00PM - 6:00PM CET.

If this meeting doesn't already appear on your calendar, or you are a new attendee, please fill out this form to attend.

Sep 01 '22 22:09 ngzhian

@penzn fyi

Sep 01 '22 22:09 ngzhian

@penzn I will be OOO on this day, can you please run it?

Sep 16 '22 23:09 ngzhian

Yes, I will.

Sep 17 '22 03:09 penzn

Seems like there are no agenda items, should we cancel? Keep in mind that I can't send cancellation notices.

Sep 29 '22 17:09 penzn

Given that this is the last meeting before the in-person CG meeting and a phase 4 poll for relaxed SIMD, should we wrap up the discussion on https://github.com/WebAssembly/relaxed-simd/discussions/85? We could include a poll ratifying the final instruction set at that meeting, and continue async discussion too. I just wanted to make a note here that we won't have another subgroup meeting before the final vote.

Sep 29 '22 18:09 dtig

Seems like a good idea, let's meet.

Sep 29 '22 18:09 penzn

Sounds good, as @ngzhian is out, let's make sure that @Maratyszcza is able to attend so if we do take any votes one of the champions will be present.

Sep 29 '22 18:09 dtig

On that note, given that we are approaching phase 4, I am curious what is the resolution of #71 (encoding non-determinism), though this might be a question to @ngzhian.

Sep 29 '22 18:09 penzn

I will attend tomorrow

Sep 29 '22 18:09 Maratyszcza

I'd also like to attend tomorrow to discuss the non-determinism point that @penzn mentioned in advance of the phase 4 vote. I've just filled out the form linked above.

EDIT: commented here regarding a relevant point I'd like to discuss.

Sep 29 '22 21:09 conrad-watt

@conrad-watt Sent a meeting invite to the email in your GitHub profile.

Sep 30 '22 05:09 Maratyszcza

I'll post a better summation comment later, but for now here's a raw link to my slides.

Sep 30 '22 17:09 conrad-watt

PR to add notes is open WebAssembly/meetings#1122

I suggest we have one more meeting before the in-person meeting. @dtig, @ngzhian what do you think?

Sep 30 '22 17:09 penzn

This comment is to sum up my position and attempt to distill some conversations and points that were raised at the previous meeting.

I'm focussing on the fma discussion here since I believe it's the only example in this proposal where we're relying on the precision of the current "list" non-determinism in comparison to the more permissive "set" non-determinism. If this belief is wrong, please correct me!

big picture

At the big picture level, there seem to be three values in tension here - one might want to...

ensure that "reasonable" programs and compilation schemes are semantically well-behaved (that is, staying within the envelope of guarantees provided by the specification).
avoid the need for a new version of non-determinism in the specification which requires persistent implementation/platform "knowledge".
avoid instructions with significantly inconsistent performance across platforms.

There seem to be two reasonable ways forward:

(a). the incumbent "list" non-determinism, which compromises on (2)
(b). my suggested "set" non-determinism (i.e. like Wasm's existing floating point NaN non-determinism) with single-rounding fma, which compromises on (3)

To be clear, while I'm advocating for solution (b), I think solution (a) is still ok. There's also a space of solutions I (currently) wouldn't be happy with, such as relaxing to set non-determinism without adding single-rounding fma, which AFAIU would completely torpedo value (1) above.

fma uses

There are three main scenarios for fma use that we care about, as noted by @ngzhian in this comment.

In the first scenario, a source program assumes the existence of a single-rounding fma intrinsic/primitive. To implement this primitive, solution (a) must compile two modules and dynamically determine which one to instantiate based on the rounding behaviour of qfma. If qfma is not single-rounding, the single-rounding fma primitive must be software-emulated at the Wasm level. In solution (b), the compilation scheme can just use the deterministic fma instruction, essentially deferring potential software emulation to the engine.

In the second scenario, a source program has two paths - one where the fma primitive is single-rounding and fast, and one where it is slow/absent. In this scenario, both solution (a) and solution (b) test qfma to see which path to take.

In the third "don't care" scenario, both solutions can just use qfma.

points during discussion

list non-determinism makes testing easier

If I understand correctly, the point is that code using qfma can be currently compared against an oracle which checks both the "always single-rounding" and "always double-rounding" outcomes, instead of needed to worry about the spec's allowance of arbitrary choice. Pragmatically, such an oracle could still be used even if the spec allows set non-determinism - the oracle would technically be formally imprecise but any discovered inconsistencies would still be valuable as implementations would be expected to consistently apply one rounding choice (i.e. the inconsistency is likely a program bug). It's also worth noting that this approach doesn't scale well to multiple list non-determinism instructions, where all combinations of list choice across the instructions would need to be considered.

we have already given up on consistent performance to some extent (e.g. popcnt)
counterpoint - software-emulated single-rounding fma may be so slow that you would never want it to actually execute

It would be good to know the extent to which this latter point is true. If toolchains could find value in targetting single-rounding Wasm-level fma even if it may be software-emulated, that would be a point in favour of solution (b). It's also worth noting that solution (a) must also software-emulate single-rounding fma in the first scenario at the Wasm level, given @ngzhian's point that we want to support such code (i.e. instead of instruction-level performance inconsistency, we have the same performance inconsistency inserted at the compilation scheme level).

single-rounding fma isn't vector-specific and would represent scope-creep of the "relaxed SIMD" proposal

I view this instruction as helping the relaxed SIMD proposal to require a less ambitious extension to the specification's concept of non-determinism, since I don't think a design where we generalise to set non-determinism is acceptable without it. There is a small aesthetic wrinkle in lacking scalar fma, but we can always consider it as a separate proposal (if ever), and I'd be happy to make this argument to the wider group during the phase 4 vote if anyone complains.

other points regarding set non-determinism

it may help support code migration more efficiently
it is a conservative choice (modulo the new fma instruction requirement), in the sense that refining the specification to list non-determinism later is web-compatible

Oct 03 '22 15:10 conrad-watt

Thanks for the excellent write up, @conrad-watt!

In solution (b), the compilation scheme can just use the deterministic fma instruction, essentially deferring potential software emulation to the engine.

My understanding (and @Maratyszcza, please correct me if I'm wrong) is that in all cases there is a better option than emulating single-rounding vector fma, so no program would actually want to use this proposed fma instruction.

In hypothetical code migration scenarios, code that starts on a fast-fma machine and goes into the code path using the fma instruction but then gets migrated to a slow-fma machine would fall off such a catastrophic performance cliff that the migration could not be seen as anything other than a bug. List non-determinism usefully formalizes this as a bug.

The reasonable fixes for that bug (no matter what the spec says) would be for the engine provider to 1) limit migration to pools of machines offering the same native semantics or 2) to always use the reference implementation of (q)fma. Both of these options at least provide "predictable" performance (i.e. no performance cliffs) and both of them make qfma an acceptable alternative to fma.

I do agree that it would be nice if we didn't have to spec list non-determinism, but I also see it as a useful part of the proposal that carries its own weight. In contrast, I would consider a deterministic fma instruction to be a large footgun that is not useful in practice.

Oct 03 '22 22:10 tlively

My understanding (and @Maratyszcza, please correct me if I'm wrong) is that in all cases there is a better option than emulating single-rounding vector fma, so no program would actually want to use this proposed fma instruction.

I can believe this, but there seems to be some dissonance with scenario (1) in @ngzhian's comment - such code would either have to fail to compile in the current solution, or have a software emulation path predicated on testing qfma that's at least as slow as engine-emulated fma. So either we're already ok with such code having worse performance characteristics on architectures where qfma is double-rounding, or we're actually not intending to support such code (because the slow path is slow enough to be a "bug").

In hypothetical code migration scenarios, code that starts on a fast-fma machine and goes into the code path using the fma instruction but then gets migrated to a slow-fma machine would fall off such a catastrophic performance cliff that the migration could not be seen as anything other than a bug. List non-determinism usefully formalizes this as a bug.

With list non-determinism, if the code starts on a machine which has single-rounding qfma, any migrations must preserve this even if the migrated-to machine can't implement it efficiently, and even if the code doesn't care - so I think the performance cliffs in code migration are more hazardous with the current solution, which can't distinguish "care" vs "don't care" uses of qfma. With solution (b) qfma can be migrated and switch from single- to double-rounding, and uses of fma explicitly mean "care", at the risk of a performance cliff.

Oct 03 '22 23:10 conrad-watt

With list non-determinism, if the code starts on a machine which has single-rounding qfma, any migrations must preserve this even if the migrated-to machine can't implement it efficiently, and even if the code doesn't care - so I think the performance cliffs in code migration are more hazardous with the current solution, which can't distinguish "care" vs "don't care" uses of qfma.

This would be allowed according to the spec, but is user-hostile enough that I expect code-migrating engine providers would provide one of the two solutions I described instead, so I would be surprised if this becomes a concern in practice (assuming we ever get code migrating engines that would make any of this a concern in practice).

I don't want to get too deep in the weeds here since it's all hypothetical, but I do think it's reasonable to assume as a starting point that engine providers and their users are aligned in trying to prevent performance cliffs.

Oct 04 '22 03:10 tlively

tlively:

I do agree that it would be nice if we didn't have to spec list non-determinism, but I also see it as a useful part of the proposal that carries its own weight.

I have to wholeheartedly disagree. It's not a useful but a particularly harmful part of this proposal. The technical term "list non-determinism" obscures what it really is: the introduction of implementation-dependent behaviour.

This destroys one of the basic value propositions of Wasm, that also happens to be one of its most popular features, namely that it's 100% portable, i.e., the (guaranteed) semantics is the same everywhere. As such, this extension has much broader (negative) impact beyond just SIMD, and I'm still highly concerned about it. If there is an even half-way decent alternative, then we should take that.

Oct 04 '22 07:10 rossberg

The entire point of this proposal (no matter how we spec it) is to introduce implementation-defined behavior in situations where the performance win is worth it, and embracing that in the spec better captures both its intent and how it will be used in practice.

Applications and engines for which this is a bad trade off can continue to not use or provide relaxed SIMD.

@rossberg, can you explain how it’s any better to specify this implementation-defined behavior as set non-determinism rather than list non-determinism if that doesn’t change the assumptions programmers make or the way the proposal is used? AFAIU, we were discussing switching to set non-determinism only because it would be cleaner and more convenient to have only one kind of non-determinism in the spec, not because it somehow avoided having implementation-defined behavior.

Oct 04 '22 13:10 tlively

I think it'll be useful to have specifics to look at from the application perspective, @ngzhian /@Maratyszcza I remember there were code links previously shared for how applications would use this? Would you be able to link it here?

This proposal does expose some implementation defined behavior regardless of how we decide to specify the proposal, so the core of the discussion seems to be what exactly do we want the spec to reflect? I have a mild preference for keeping the Spec flexible, i.e. set non-determinism as that seems to be more flexible of the two approaches, and we can always go towards option (a), but not the other way around. Concretely I'd like to make sure we understand what's involved to have only option (b), with option (a) being the status quo, which means that we won't need additional work apart from what's already in the tools/engines/spec (would probably be useful to have data points from @ngzhian about where we landed on post-phase 3 feedback as well). The intent is to understand how much work is involved across the layers, and if it would be worth supporting it vs. pushing for option(a). To only support option (b),

Include a deterministic FMA
Instructions that lower to baseline FMA implementations, will need a deterministic lowering
Engines need to support both FMA(possibly by calling out to the runtime)/QFMA
The tools portion of this is less clear to me. Aside from an additional instruction being supported @conrad-watt mentioned something about different code in the tools that would target which FMA instruction to generate, perhaps I missed or misunderstood something?

In the ideal case, applications know how to handle FMA/QFMA which this proposal already assumes they would need to. In the case where these instructions aren't used correctly, my opinion is that working with terrible performance is still better than erroring out, or unexpected results.

Oct 04 '22 17:10 dtig

something about different code in the tools that would target which FMA instruction to generate, perhaps I missed or misunderstood something?

For a toolchain currently built around option (a), it should be fairly simple to switch to option (b). Currently it's already necessary for the toolchain to condition on the behaviour of qfma both to implement a "guaranteed single-rounding fma" primitive, and a "single-rounding fma is fast" check, and the structure of these checks don't change (or get eliminated) with option (b). If the toolchain is currently not implementing these primitives and just doing a "don't care" compilation to qfma, it can continue to do so.

I also think it's worth emphasising again that option (a) delivers no more performance to code that relies on a single-rounding fma primitive than option (b). Instead it's a case of whether the performance cliff is inserted as separate Wasm-level paths by the toolchain (option (a), testing qfma and falling back to software emulation), or by the engine/language level (option (b), implementing fma though emulation, but still countenanced by the toolchain emitting fma instead of qfma). This gets back to the "value 2" vs "value 3" tension I outlined above. If option (b) is too slow here, then so is option (a) (at least at a program level).

Oct 04 '22 18:10 conrad-watt

I will note that currently no toolchain reasons about any of this and that all relaxed SIMD usage (that I'm aware of) comes from users hand writing either intrinsics that are opaque to the compiler or raw assembly. In my experience, SIMD programmers care deeply about what particular engines do in practice, so I'm skeptical of any solution that requires them to follow certain patterns to guarantee some level portability to other engines. The more scalable solution is to get the spec to meet SIMD programmers where they are and make better portability guarantees about the code they naturally want to write.

Oct 04 '22 18:10 tlively

I will note that currently no toolchain reasons about any of this and that all relaxed SIMD usage (that I'm aware of) comes from users hand writing either intrinsics that are opaque to the compiler or raw assembly ... I'm skeptical of any solution that requires them to follow certain patterns to guarantee some level portability to other engines. The more scalable solution is to get the spec to meet SIMD programmers where they are and make better portability guarantees about the code they naturally want to write.

I agree - I'm assuming that the main relevant intrinsics are a "guaranteed single-rounding fma" primitive, a "don't care fma" primitive, and possibly an "is single-rounding fma supported/fast?" primitive. Both solution (a) and (b) need to do roughly the same things to support these, with roughly the same performance characteristics.

EDIT: if the point is that SIMD programmers may want to work with intrinsics that have a 1-to-1 correspondence with Wasm operations, then the difference between solution (a) and (b) is that in (a) the programmer must perform a qfma test to predict semantic behaviour (single-rounding), while in (b) the programmer must perform a qfma test to predict performance of the single-rounding fma primitive. One doesn't seem strictly preferable to the other.

Oct 04 '22 19:10 conrad-watt

I think it'll be useful to have specifics to look at from the application perspective, @ngzhian /@Maratyszcza I remember there were code links previously shared for how applications would use this? Would you be able to link it here?

https://hal.inria.fr/inria-00000895/document look for Fast2Mult (for posterity, mentioned in #71)

Oct 12 '22 19:10 ngzhian

(b). my suggested "set" non-determinism (i.e. like Wasm's existing floating point NaN non-determinism) with single-rounding fma, which compromises on (3)

(Going to repeat some stuff here to make sure I understand things correctly.)

The purpose of "list" non-determinism is to enforce that 1 and 2 perform the same number of rounding:

if (qfma(x, y ,z)) // 1
  qfma(x, y, z)    // 2
else
  // fall back to some other code path

"set" non-determinism does not guarantee this.

Reviewing your slides, i think this is covered in slide 15 "Scenario (2)", where there is a test for qfma single rounding, if the test succeeds, the assumption is that deterministic fma is fast.

bool b = <test if qfma is single-rounding>

if (b)
  instantiate module A - wants fma to be fast // assumption here that single rounding qfma == fast fma
else
  instantiate module B - avoids use of  fma

This could be an unexpected performance cliff, as this assumption is not guaranteed. But I guess that with the original "set" non-deterministic qfma, we also have the assumption that single rounding qmfa is a fast fma. It would be nice if we can make this connection between speed and rounding explicit (but I know the spec doesn't say anything about performance, so perhaps this is not possible).

Follow-up question will then be, if we add a fma instruction, how will engines implement it? There is no fast emulation path for an fma, will it be via math.h?

Re: fma is neither relaxed or SIMD. It does feel weird adding them to this proposal, but my goal is to advance this proposal, and so I will do what it takes to reach this goal :) This might be an opportunity to try a "fast-track" proposal as well...

Oct 12 '22 20:10 ngzhian

@ngzhian

(Going to repeat some stuff here to make sure I understand things correctly.)

I think your summary is exactly right!

... [t]his could be an unexpected performance cliff, as this assumption is not guaranteed. But I guess that with the original "set" non-deterministic qfma, we also have the assumption that single rounding qmfa is a fast fma. It would be nice if we can make this connection between speed and rounding explicit (but I know the spec doesn't say anything about performance, so perhaps this is not possible).

(I'm assuming you mean 'original "list"' here) Yes, in both solutions the performance expectations are not normatively guaranteed by the spec abstract machine. It would be appropriate to add a fairly prominent note in the spec saying "if qfma is single-rounding, expect [q]fma to be fast".

Follow-up question will then be, if we add a fma instruction, how will engines implement it? There is no fast emulation path for an fma, will it be via math.h?

Yes, there would be some slow emulation path. A real performance cliff at the instruction level, but my belief (to be challenged!) is that anywhere this performance cliff is hit, an analogous program-level performance cliff (or correctness bug) would be hit in solution (a) (due to lack of single-rounding qfma). I don't have much of an informed opinion on emulation strategies, I'm just keying off the two previous comments (and associated links) here.

Re: fma is neither relaxed or SIMD. It does feel weird adding them to this proposal, but my goal is to advance this proposal, and so I will do what it takes to reach this goal :) This might be an opportunity to try a "fast-track" proposal as well...

At least with respect to "SIMD-ness", presumably the same arguments apply to qfma in that there could also be scalar versions? fma is very linked to qfma in terms of performance expectations etc, so I'm personally comfortable with it fitting thematically into the "relaxed" proposal (also, single-rounding fma is "relaxing" a design principle of Wasm with respect to predicable performance, existing examples like popcnt notwithstanding).

Oct 13 '22 00:10 conrad-watt

It would be appropriate to add a fairly prominent note in the spec saying "if qfma is single-rounding, expect [q]fma to be fast".

Yes, we should definitely do this if we go down the add-fma path.

(I'm assuming you mean 'original "list"' here)

Yes, oops!

At least with respect to "SIMD-ness", presumably the same arguments apply to qfma in that there could also be scalar versions? fma is very linked to qfma in terms of performance expectations etc, so I'm personally comfortable with it fitting thematically into the "relaxed" proposal (also, single-rounding fma is "relaxing" a design principle of Wasm with respect to predicable performance, existing examples like popcnt notwithstanding).

Interesting thought. We can take this chance to get scalar fma and scalar qfma in as well. It's not a lot more work, represents a bit of scope creep, but with good reasons. We have a SIMD sync tomorrow, I will bring this up for discussion (if you won't be there). Thanks!

Oct 13 '22 16:10 ngzhian

Interesting thought. We can take this chance to get scalar fma and scalar qfma in as well. It's not a lot more work, represents a bit of scope creep, but with good reasons. We have a SIMD sync tomorrow, I will bring this up for discussion (if you won't be there). Thanks!

My inclination is that scalar fma and qfma could be "quick" proposals following on from relaxed SIMD - the incumbent proposal was only going to have vector qfma anyway so I think the argument is the same whether we go with solution (a) or (b). Vector fma is just the minimum addition to facilitate set non-determinism, given that vector qfma is part of this proposal.

In any case, I'll be at the meeting so happy to discuss this further.

Oct 13 '22 16:10 conrad-watt

I want to clarify the meaning and the implication of the non-determinism classes highlighted here as I couldn't find them defined anywhere.

"set" non-determinism says "when the spec provides a set of possible outputs for a given instruction, an implementation must consistently choose the same index of the output regardless of the input value/instruction placement"
"list" non-determinism says "when the spec provides a set of possible outputs for a given instruction, an implementation is free to generate any output in every particular instruction / for any class of inputs"
As a consequence, in "list" non-determinism, any program that uses quasi-fma (fma that might do single or double rounding), may get different results for the same inputs even within the single module
As a consequence of that, in any use of qfma the program that does two passes over the input data may compute internally inconsistent results, whereas in "set" non-determinism this can never happen
As an additional consequence of that, "list" non-determinism makes it impossible to reliably detect if the qfma is doing single or double rounding for the purpose of selecting a different algorithm at runtime

Is this reasonably accurate?

Oct 14 '22 15:10 zeux

@zeux your "set" and "list" non-determinism should be reversed (although I should be clear that this is terminology I made up for this discussion!).

Wasm's existing non-determinism is "set" non-determism - the spec provides a (mathematical, unordered) set of possible outputs for each operation, and each instance of the operation is free to pick a different member of the set.

This proposal's non-determinism is "list" non-determinism - the spec provides an "ordered" list of possible outputs, and each instance of the operation must pick the same index of that list.

I'm proposing a modification to this proposal with the aim of eliminating the need for the new kind of non-determinism.

Oct 14 '22 15:10 conrad-watt