design icon indicating copy to clipboard operation
design copied to clipboard

We need persistent floating-point rounding mode control

Open KloudKoder opened this issue 3 years ago • 59 comments

There's some history to this. Bear with me.

Typical floating-point rounding policies are: chop, round-negative, round-positive, and round-nearest-or-even. Intel made the dumb decision of encoding this policy in the floating-point control word, which means that if you want to change it frequently, you need to manipulate the control word (FLDCW) all the time, which is expensive.

nVidia realized that this was a nightmare for interval arithmetic, which has risen in prominence for scientific computing in the past several years, mainly because it requires constant changes in rounding policy (because, usually, the lower limit goes down while the upper limit goes up after every floating-point instruction). So nVidia did things the right way: they encoded the rounding policy inside the instruction itself.

Looking at your spec, it seems like round-nearest-or-even is the only supported mode in most cases. That makes interval math effectively impossible, which will only limit the utility of WASM for scientific applications.

Please allow us to prefix or otherwise alter floating-point instructions so as to specify the rounding mode. On Intel/AMD, you can just optimize the prefix away if it's the same one as before (and no intervening branch targets exist). If the prefix is never used, then everything is backward-compatible to round-nearest-or-even, so the control word never needs to be written; this is easily checked during code validity authentication. But on nVidia, the prefix gets sucked into a bitfield in the instruction.

Please implement this. It will enable more and more useful applications than you might currently imagine.

KloudKoder avatar Oct 11 '20 02:10 KloudKoder

The best way to fit new rounding behavior into WebAssembly would be as new instructions (rather than an instruction prefix or a persistent rounding mode or something like that). Could you list the new rounding instructions you would like to see?

tlively avatar Oct 12 '20 19:10 tlively

@tlively Fair enough. I've never been a huge fan of prefixes, either.

To be clear, there's rounding in the sense of "round an existing float to an integer", and then rounding in the sense of "nudge the units-of-last-place (ULP) according to the infinitely precise result".

In the first case, we can get by with the usual suspects:

a. Round nearest-or-even. b. Round toward negative infinity. c. Round toward positive infinity. d. Chop toward zero (truncate).

In the second case, we can apply the same rounding policies, but to the ULP instead of the integer part. Implicitly, we first of all have to know the maximum error bounds on the infinitely correct result, because if they're too big, then the ULP is still undetermined. For arithmetic operations, this is straightforward and already done right in hardware. For operations which may produce irrational results, it's first necessary to know the error bounds on the truncated portion (due to series truncation). One hopes that the CPU designers did this right. I have serious doubts about that, but it might be correct for fsqrt, which AFAIK is your only such operation. You probably can't ever support other transcendental functions in a well-defined and cross-platform way because the CPU designers probably didn't really nail the error terms. (Said functions would need to be implemented by the programmer in terms of your existing arithmetic operations. Slow, but well-defined, cross-platform, and futureproof. Lots of cores, lots of free cycles, no worries. The only other option would be to hope that all CPU vendors implement the same transcendental functions in exactly the same way. Things get really complicated if that happens to be true because they're all wrong in the same way, which is why I think the software emulation approach is a necessary evil.)

Obviously, the first case can be done after the fact because we just need to dispose of the fractional part. But in the second case, everything needs to happen in one instruction because once an instruction completes, any extra bits of precision have been forgotten. So if you want individual instructions, you need stuff like "multiply and round the ULP toward negative infinity" and "square root and truncate everything after the ULP". The good news is that, if you do it this way, then it will fit like a glove with nVidia, which is probably more relevant going forward than AMD64. The bad news is that you'll have to employ some smarts to avoid touching the control word too often on the latter. And by the way, all your existing arithmetic instructions would now to be suffixed with "and round nearest-or-even".

I'll pause here. Are we on the same page?

KloudKoder avatar Oct 14 '20 00:10 KloudKoder

Yes, the philosophy here sounds good. We've had similar discussions over in the SIMD proposal. The next part is making sure that the proposed instructions have (reasonably) efficient lowerings on relevant architectures (at least x86/x86-64/ARM). I'm not sure how nvidia fits into the picture, though. WebAssembly doesn't typically run on GPUs. It would also be good to mention what kind of languages or applications would benefit from having these instruction in order to motivate working on them.

tlively avatar Oct 14 '20 00:10 tlively

"WebAssembly doesn't typically run on GPUs." -- It certainly does, in 2029.

But allow me to suggest a potentially better solution for the second case. It seems to me that the only practical application for ULP rounding is interval math, full stop. Therefore, you could cut down the explosion of instructions and improve performance by providing native interval support. So "divide 64-bit floats" becomes "divide 64-bit float intervals". Yes, there are lots of corner cases, but by doing it this way, you would solve all those problems once and for all, allowing higher-level languages to translate interval types directly into WASM. I am entirely sympathetic to the need to avoid CISC nonsense, but intervals are so powerful and so universal that I think it's worth being a little CISCy in this case.

The other reason to implement native intervals is that higher-level languages do not, to my knowledge, have instructions for ULP manipulation. In many languages, it wouldn't even mean anything because floats are undefined garbage anyway. But it's easy to imagine adding macros to support basic interval functions in popular languages, which then filter straight to the WASM layer. No chance for creating super subtle bugs due to misguided ULP manipulation.

If I recall, an Arianne 5 rocket blew up back in the naughts due to floating-point imprecision which would have been prevented with intervals. And if I'm not mistaken, the safety of the Large Hadron Collider was proven with interval analysis, as well. But it's not just about safety. Computation can be exponentially accelerated, in some cases, using intervals, simply because lower precision is often acceptable, and the time savings propagate to many levels of the function tree. This facilitates much faster raytracing, for instance. Fluid dynamics stand to benefit as well. (This is all abstractly related to the Shrinking Bullseye algorithm.)

I'm intrigued by your comments regarding SIMD, but I'll leave that for another thread.

KloudKoder avatar Oct 14 '20 01:10 KloudKoder

It's not clear to me why interval arithmetic should be a part of the WebAssembly ISA as opposed to a software library that compiles to existing WebAssembly instructions. Do native architectures support interval arithmetic directly? If not, there's probably no benefit to building it into WebAssembly.

tlively avatar Oct 14 '20 01:10 tlively

These are the reasons that intervals should be supported in WASM:

  1. It's almost impossible to support them in higher-level languages due to the lack of ULP manipulation primitives. What support does exist tends to be extremely slow because the correct result can still be obtained, but only through convoluted means involving type casting.

  2. WASM is the missing link we need in order to support widespread computing-on-demand, precisely because it's well-defined down to the bit level, so it's easy to validate that outputs from untrusted nodes are identical and thus credible. Without well-defined behavior, we end up with approaches which require the programmer to do manual result comparison in a code-dependent way. It just doesn't scale.

  3. Good science and engineering needs interval math, but it's not being exploited much because it's been bolted onto higher-level languages instead of built in at the bottom. nVidia solved this problem, but there's a lack of decent IR code to access it from higher-level languages. WASM could give us that missing IR.

  4. The future of distributed computing is WASM, but only if you can make it scalable. I realize that most people think of it as a browser extension just for running faster web pages, but it could rule the distributed computing world by 2030 if you play your cards right. I can't emphasize this enough. Without intervals, we get scalable ambiguity rather than scalable clarity.

If the only acceptable solution is 4 different rounding flavors of your existing primitives, then I'll take it, but the path to fewer bugs is just to implement intervals at the bottom. This could be extended to SIMD at some point, as well.

KloudKoder avatar Oct 14 '20 02:10 KloudKoder

Wasm is generally designed as a minimal abstraction over hardware, so the design pressure is to expose (standardized) versions of what is in hardware, and not build the next higher-level concept unless it is impossible not to. In this case it would seem that rounding modes would be both closer to hardware and the more minimal addition. It's my understanding that the rounding mode control design in IEEE 754 is meant for this purpose, and has other purposes too, outside of interval arithmetic.

titzer avatar Oct 14 '20 07:10 titzer

If you do it that way, then at least the compilation process would be straightforward, along with extension to SIMD. This also means that interval implementation will need to be done by porting these rounding-aware arithmetic primitives primitives to higher-level languages, which is more of an uphill battle. But OK, what's the next step?

KloudKoder avatar Oct 14 '20 13:10 KloudKoder

If I understand correctly, you have the same problem with high-level languages for native compilation, so I don't think that's a Wasm-specific problem that Wasm needs to solve.

bvibber avatar Oct 14 '20 15:10 bvibber

@brion Well, as I said above, I'm happy enough with the idea of just implementing rounding-aware arithmetic primitives in WASM.

The general problem with ULP manipulation, whether you solve it with such primitives or full blown interval support, is that WASM is likely to become a method whereby distributed computing is facilitated for HLLs. So in effect, it is the hardware. You can't go directly from HLL to CPU/GPU machine code and have it run on anonymous targets all over the world. (Imagine trying to run a distributed computing network which mandates that you need nVidia XYZ GPU in order to even participate.) So you have to go to middleware first, of which the obvious (and perhaps only) choice is WASM. So if the HLL programmer wants a feature that is efficiently implementable in hardware, but WASM doesn't support it, then he's stuck having to do nonperformant emulation.

KloudKoder avatar Oct 14 '20 17:10 KloudKoder

@KloudKoder , can f64.fma help with the performance of interval arithmetic libraries at least for f64?

Yaffle avatar Nov 13 '20 10:11 Yaffle

@Yaffle Can you explain your reasoning as to why multiply-add would help? For instance, how would this accelerate the process of adding 2 intervals, given that presumably it's still going to do nearest-or-even?

KloudKoder avatar Nov 16 '20 01:11 KloudKoder

it helps to find the error of the multiplication of two f64 numbers or division of f64 numbers: sum = a + b; error = -sum + absmax(a, b) + absmin(a, b); product = a * b; error = fma(a, b, - product); quotient = a / b; error = -fma(quotient, b, -a); if you know the error, you can round value by incrementing (for example, when error > 0 and rounding to positive infinity) or decrementing (for example, when error < 0 and rounding to negative infinity) the value by the smallest amount. This way, the resulting interval will be tightest. Although, If you don't need tightest interval, you could just increment/decrement both ends... Btw, there is a website with some benchmark - http://verifiedby.me/kv/rounding/index-e.html#:~:text=Benchmark Hm...

Yaffle avatar Nov 16 '20 09:11 Yaffle

@Yaffle Wait a second, are you talking about fma (which I just realized isn't actually in the spec, if the search function is to be believed) or fmax/fmin?

Just starting with addition, we have:

(a, b) + (c, d) = (RoundNegative(a + c), RoundPositive (b + d))

How that would involve a multiplication, I don't presently see. And multiply-add takes 3 parameters, not 1. Clearly, I'm misinterpreting your notation.

KloudKoder avatar Nov 17 '20 02:11 KloudKoder

@KloudKoder , fma that is fused-multiply add, I have updated my notation

Yaffle avatar Nov 17 '20 07:11 Yaffle

@Yaffle I think I finally understand your point. What you mean is that Intel's various FMA instructions compute:

a * b + c

to infinite precision, and then round nearest-or-even. (They don't round after just computing the product.)

This means, for one thing, that:

product = FMUL(a, b) error = FMA(a, b, -product)

is equivalent to:

error = NearestOrEven(ab - FMUL(a, b))

which by definition is off by no more than half a ULP, so you need only nudge the result by 1 ULP in order to obtain an upper bound on the error.

This is a valid approach, but ULP nudging requires intensive overhead: you need to first treat the result like an integer, then add your ULP (as an unsigned 1), then see if you wrapped into the exponent field, or perhaps created a NaN, etc., then potentially do a rollback and update the result to infinity or whatever. There are other ways to do this, but AFAIK also involving painful overhead. And do note that I haven't demanded the tightest interval -- just a "tight enough" one for practical precision requirements. So do you have a fast way to nudge? If not, then unfortunately it's back to my original point, which is that intervals are nonperformant under the current ABI, relative to what's actually possible on real hardware.

Also, note that I've written FMA above, which alludes to the Intel instruction set. Webassembly's fma, by contrast, doesn't seem to exist. Do you have a link? It doesn't come up in the doc search on the website.

Finally, from your link:

"addition/subtraction/multiplication/division/sqrt including the rounding mode specification to the command itself"

That's precisely what this proposal is all about. It's great to have that capability, but we need a direct way to exploit it.

I'm happy to have my pessimism disproven...

KloudKoder avatar Nov 18 '20 02:11 KloudKoder

@KloudKoder , https://github.com/mskashi/kv/blob/2ec8bac6216d25b226c09e1740487837a4fc5fd5/kv/rdouble-nohwround.hpp#L97

  • this is another variant of "nudge" function, seems, it only does x + ABS(x) * (2**-53+2**-105) when abs(x) > 2**-969, which covers most of the f64 values...

f64.fma is mentioned at https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#:~:text=fma but if you don't need the "tighest interval", you could just nudge by 1 ulp both ends... although, this could be bad (?).

anyway, you are right, that the overhead could be big

Yaffle avatar Nov 18 '20 10:11 Yaffle

@Yaffle I now see f64.fma under "Additional floating point operators". I'm willing to assume that it will be adopted as an instruction. And definitely I wouldn't demand the tightest interval, which is unlikely to offer any practical advantages. Now, looking at your alternative nudge:

x + ABS(x) * (2**-53+2**-105)

it's clear that the point of this is to add the result (x) to: itself shifted down so that its leading 1 appears in the ULP. I'm not entirely sure why we need the (2**(-105)) -- maybe to break the tie case of nearest-or-even in favor of a nonzero nudge -- but in any event, here's what we would need to do:

  1. Given f64 result x.

  2. If x is +/- 0.0, then nudge it up/down to the smallest positive/negative number and quit.

  3. Subtract 53 from the exponent field. Call the result y. (In some architectures, this might require transport between float and integer registers. Let's just assume there's no cost for that.)

  4. Use a conditional set (flag-to-integer) instruction to remember if #3 resulted in a wrap of the exponent (usually not). This involves several integer instructions, however you do it, so far as I can see.

  5. OR the result of #4 to a something-failed flag, f, which is extremely likely to end up as 0.

  6. Let z, the final result, be the sum of x and y. Just simply nearest-or-even rounding will ensure that z is a valid nudge of x.

  7. At some future checkpoint prior to commit, verify that f is 0, and fail out if not.

This does in fact work. It also disrupts the pipeline badly, resulting in a long series of essentially serialized instructions. This could be partially overcome by weaving together a whole bunch of interval operations which can occur in parallel.

Thus it still seems that supporting instructions with builtin rounding modes is far cheaper. One could also just ignore the trend toward integer arithmetic, and thereby make WebAssembly less useful in scientific computing. That would be a shame, though.

Thanks for helping to demonstrate the scope of this problem so thoroughly. Where to go from here?

KloudKoder avatar Nov 19 '20 06:11 KloudKoder

What if I write the Intel/AMD support for you? Would that offer you maintainers enough motivation to port it to other platforms and make it part of the spec? We can argue over the specifics and corner cases, but there's no point in even discussing them if my proposal here isn't sufficiently compelling.

KloudKoder avatar Mar 07 '21 05:03 KloudKoder

@KloudKoder thanks for offering to help implement :) Unfortunately there's a lot more than implementation that goes into getting a new proposal standardized, so this is unlikely to advance without more of the community showing interest. See one of my previous comments on why nice-to-have features often don't make the cut, even if they don't seem to have any technical downsides.

tlively avatar Mar 19 '21 22:03 tlively

@tlively I read your previous comment and I understand why new opcodes are a hard sell.

In this case, that's particularly vexing because intervals have suffered this chicken-and-egg adoption problem since the day they were first proposed for use with IEEE floats. Basically there's a great temptation to avoid using them in high level languages because they're so slow, but no one wants to do the dirty work of optimizing their performance by creating intrinsic support because (surprise!) no one finds them appealing to work with (unless they're desperate for error containment). At the end of the day, making intervals efficient and trivially accessible is a matter of ensuring data security -- just in the analog instead of the digital realm. We're wasting nVidia's (and perhaps others') great contribution of interval-ready floating-point instructions.

One glimmer of hope is that there are already public specs on the expected behavior, available both as a comprehensive body of axioms and also a "diet" version for basic arithmetic. So at least some aspects of their expected behavior are already defined for us.

KloudKoder avatar Mar 20 '21 16:03 KloudKoder

In this case adding interval arithmetic to WebAssembly wouldn't help that chicken-and-egg problem because it would not be any faster than an interval arithmetic library compiled to WebAssembly unless there is underlying hardware support. I recommend going the easier route of writing such a library for now, and if common hardware natively supports interval arithmetic in the future, it will make sense to revisit adding it to WebAssembly as well.

tlively avatar Mar 22 '21 18:03 tlively

@tlively I get that, but it's not accurate. nVidia has hardware interval primitives, in the sense of floating-point instructions with builtin rounding modes. They aren't accessible via WebAssembly.

The Intel way involves setting a control word first, which means you need to do a bunch of high limits in series, then a bunch of low limits, in order to minimize control word manipulation. Writing a library doesn't help because this is something that needs to occur on the compiler level. For example, I could create macros in a library which do interval operations in AVX2 or X87, but unless I control the compiler, I can't get it to do those reorderings so as to avoid all the crippling control word manipulation.

You may see WebAssembly as a way to accelerate websites. While I don't disagree, I also see it as a global distributed computing platform just waiting to happen. Golem, for instance, has already enabled that it some hacky way, but it could be so much more if it ensured analog security by including this type of support. This is particularly important when attempting to ensure that results obtained via different platforms (including those not using WebAssembly) are consistent with one another, as floating-point results of complicated computations are seldom exactly equal. No intervals means this effectively can't happen and WebAssembly never reaches its true potential. So there's a lot more to this than just "please make this fast".

KloudKoder avatar Mar 23 '21 03:03 KloudKoder

Oh I see, I didn't realize this was already supported by Intel hardware. How widespread is the support?

tlively avatar Mar 23 '21 21:03 tlively

It is pretty much ubiquitous on CPUs, not only on x86, it is what fegetround and fesetround library functions are for. The tricky part is that rounding mode is a "global" (per-thread) state which affects FP operations.

penzn avatar Mar 23 '21 21:03 penzn

@tlively

Oh I see, I didn't realize this was already supported by Intel hardware. How widespread is the support?

@penzn Is correct. This type of rounding support has been present on Intel for decades. As he indicated, it's a terrible design because changing the rounding mode requires a floating-point pipeline stall. nVidia had the foresight to integrate rounding mode directly into its instructions, so for instance you can "add and round toward positive infinity" or "multiply and chop toward zero".

What I'm advocating here -- if not comprehensive interval primitives -- is at least a set of opcodes and/or prefixes which will allow efficient access to the relevant nVidia instructions. Those opcodes and/or prefixes can then be at least somewhat efficiently mapped to Intel by grouping same-rounding-type instructions into dense sequences, so as to minimize rounding mode manipulation. However, this could be done as a performance enhancement; the first release could just change rounding modes literally as specified by the WASM code.

KloudKoder avatar Mar 25 '21 02:03 KloudKoder

As he indicated, it's a terrible design because changing the rounding mode requires a floating-point pipeline stall.

I don't think I indicated that ;) It is the reality of CPU FP environment, though maybe it could have been designed better.

This type of rounding support has been present on Intel for decades.

If I understand things correctly, it exists on Arm as well, at least the same fenv functions work there. As usual, there might be discrepancies between the two on how different instruction groups interact with rounding mode.

There is a discussion about better IEEE 754-2019 support in Future Features:

To support exceptions and alternate rounding modes, one option is to define an alternate form for each of add, sub, mul, div, sqrt, and fma. These alternate forms would have extra operands for rounding mode, masked traps, and old flags, and an extra result for a new flags value.

This is similar to original idea for "releaxed SIMD" fpenv: https://github.com/WebAssembly/design/issues/1401#issuecomment-789073168.

Disadvantage of first class interval support is that it would require runtime optimization on CPU, while we don't yet have a viable GPU implementation where it could be organically supported. However the idea in "Future Features" suffers from the same issue.

Advantage of interval math is that it can reduce instruction count (two for price of one), but to be useful it might need to be relatively high-level. For example, usable interval math library requires more operations than listed above and additional operations might not be easy to implement without them also becoming instructions. Another issue is scheduling - to do a few low or high limits in a row parts of different interval instructions would need to be scheduled together, which requires even more optimizations in runtime or a new execution model.

penzn avatar Mar 25 '21 16:03 penzn

@penzn Yeah my wording wasn't good. I meant that you mentioned the control word architecture, which in my opinion is a terrible design. The bottom line is that a modest percentage increase in native Intel instruction count due to changing the rounding mode results in a multifold explosion in execution time due to the pipeline stalls. As you clearly understand, the only partial mitigation for that would be to group instructions by rounding mode to the extent possible.

As much as I'd be thrilled to see complete interval support in WASM, it's unnecessary. The nVidia building blocks would be enough for practicality. (I wonder how Apple M1 does it, for that matter.) I understand that there's no GPU support yet, but I would expect that to be in WASM's future evolution. Why do you think that interval support would need to be more high-level? Even transcendental functions (e.g. log) can be straightforwardly computed from basic interval arithmetic.

As to the Future Features proposal, it would be a great help toward efficient interval implementation. I am, however, opposed to anything that would set result flags (e.g. zero, infinity, NaN, etc.) unless that step can be bypassed somehow, such as by writing the flags result to a write-only trash register. The reason is that it adds major overhead to evaluate those flags, even on a hardware level. I suspect that modern FPU pipelines know when they can skip this step, such as when 2 floating-point instructions occur in a row, eliminating the need to set flag bits based on the results of the first one.

KloudKoder avatar Mar 29 '21 01:03 KloudKoder

What do you think is the minimal set of interval operations which would be implementable and beneficial on CPU as well?

penzn avatar Mar 30 '21 23:03 penzn

@penzn I can hardly think of a more relevant question for this thread.

First of all, there are really 2 basic architecture choices:

  1. Look like Intel. In other words, have a "set rounding mode" instruction. ("Get rounding mode" isn't necessary because competent software should know what it already set. But no harm if people want that.)

  2. Look like nVidia. In other words, have 4 forms of each instruction: (a) round nearest-or-even (which is what we already have in WASM), (b) chop (inwards toward zero), (c) round toward negative infinity, and (d) round toward positive infinity.

I would gravitate strongly toward option #1 because (a) it will be simpler to optimize for Intel (even though theoretical performance will be identical in both cases) and (b) I don't think anyone wants a massive zoo of similar animals with slight differences in their stripes.

So assuming that we go with this option, then I would strongly caution against the use of any "set floating-point environment" (SFPE) instruction which simultaneously affects exception masking. This will burden rounding mode changes with lots of other logic. Separate these functions across different instructions. (SFPE instructions are OK for the sake of compatibility, but they should be clearly marked as nonperformant.) Critically, neither SFPE nor set-rounding-mode should accept variable inputs. (Imagine having to ask "What's the rounding mode?" prior to every instruction so you know what to execute on nvidia!)

Then the question becomes: what is the minimal set of instructions to be worth discussing here? Probably:

  1. Set round nearest-or-even.
  2. Set chop (round inward).
  3. Set round negative infinity.
  4. Set round positive infinity.
  5. Load ULP.
  6. Nudge inward.
  7. Nudge outward.
  8. Get sign bit (slightly different than "get sign").
  9. Xor sign bits, output to integer or flag (or maybe just "branch if sign bits (don't) differ").
  10. Copy sign bit to another floating-point value.

Just for sake of consistency, we might want the first 4 opcodes above to be assigned in a manner consistent with Intel's RC bits in the FPU control word.

#5 takes a floating-point (FP) value, then returns the smallest-magnitude FP value which, when added to the input with chop rounding, would change its value. (Implicitly, ULP(X) has the same sign as X.) If the input was either sign of zero, then the output shall be FLOAT_MIN (the smallest denormal) of the same sign. If the input was either sign of infinity, then the output shall be unchanged. ULP comes up all over the place in interval arithmetic, so there's a fundamental justification for this instruction. For example, it would be helpful in case one needs to multiply ULP by another FP value in order to compute a compound error term, as would occur when efficiently computing (Y log X) via (truncated) Taylor series. It's also used all the time in order to know when to stop computing such a series.

#6 Decrements the mantissa (without changing the sign). Decrements the exponent as well if the mantissa wraps. May result in a denormal value. If the input was either sign of FLOAT_MIN, then the output shall be zero of the same sign. If the input was either sign of zero or infinity, then the result shall be unchanged.

#7 Increments the mantissa (without changing the sign). Increments the exponent as well if the mantissa wraps. If the input was either sign of zero, then the result shall be FLOAT_MIN of the same sign. If the input was either sign of infinity, then the result shall be the same. If the input was either sign of FLOAT_MAX, then the result shall be infinity of the same sign.

#8 Returns the sign bit, which is important if there's a semantic difference between negative and positive zero, which would not be captured by a traditional "get sign of FP operand" instruction. Note that this deals with the sign bit, which in the aforementioned case actually differs from the (arithmetic) sign.

#9 is important for interval division. For example, the reciprocal of [2, 3] is [(1/3), (1/2)] (at least (1/3) and at most (1/2)). But the reciprocal of [-2, 3] is [(1/3), (-1/2)], meaning at most (-1/2) OR at least (1/3). (Inverted limits imply an open interval, wherein a finite segment is missing from the real line.) The reason is that [-2, 3] contains zero. Moreover, it contains points both to the left and to the right of zero, so the reciprocal approaches infinity of either sign. We need to know if the signs are opposite so we can determine whether or not to exchange the limits. Without this instruction, it's lots of branching with sign extraction just for the basic arithmetic function of division.

#10 This comes up all the time, although good examples escape me at the moment. Without this instruction, it's a lot of pipeline stalls.

I don't mind if the above instructions are undefined for NaN operands, or if they behave as though the NaN is a normal FP value. No point in accelerating broken math. Of course, any undefined operation will potentially leak information about the underlying hardware, and will defeat the possibility of result verification through comparison.

KloudKoder avatar Mar 31 '21 01:03 KloudKoder