design Floating-point rounding mode control prototyping

Floating-point rounding mode control prototyping

Open KloudKoder opened this issue 2 years ago • 57 comments

This is the issue for testing out approaches to floating-point rounding mode control, which is required for performant interval arithmetic. The original informal discussion is found here.

For an understanding of the practical ramifications of interval arithmetic including some use cases, I suggest this YouTube presentation by G. Bill Walster.

For starters, Thomas Lively identified the Boost library, boost::interval. An example of its rounding control approach is here.

Feel free to add references to practical use cases or libraries related to intervals. The ultimate output of this issue, hopefully, will be a performance test. Conrad Watt has suggested a switch-based approach (for testing, not actual implementation) for every floating-point instruction, wherein the subject of the switch is a rounding control (RC) global variable (nearest-or-even, round toward negative infinity, round toward positive infinity, or chop toward zero). His hypothesis is that the WASM compiler will eliminate many of the switch statements because the rounding mode (RM) at any given point is usually known at compile time. This might at least allow us to do some crude benchmarking.

A more painful but performant approach would be to have the same virtualized RC register, but to reorder instructions where possible in order to minimize the frequency of touching the actual hardware RC. This would also compile seamlessly in architectures in which the RM is embedded in the floating-point instruction opcode itself (in which case the reordering would be redundant).

At this point, we're only concerned with RISC, but as Deepti Gandluri pointed out, any mature implementation would need to address the same issues with SIMD, wherein each parallel operand adheres to a common but custom RM.

@conrad-watt @tlively @jakobkummerow @dschuff @dtig

Aug 16 '22 17:08 KloudKoder

Since the actual proposal is to add more control of rounding modes to Wasm, it would probably make sense to broaden the scope from interval arithmetic to any use of different rounding modes. Casting a wider net will help find more motivating use cases.

Feel free to add references to practical use cases or libraries related to intervals. The ultimate output of this issue, hopefully, will be a performance test.

I'm not as concerned with trying to do more performance tests at this point, since you convincingly explained how slow emulating different rounding modes would be. I'm more concerned right now about having concrete motivating use cases. Ideally we would find someone with something (an application, a library, a language, etc) they would like to port to Wasm for whom the lack of rounding mode control is a significant hurdle. Even with a perfect design and convincing benchmarks, it would be hard to motivate engine and tool implementers to work on this (or anything else) without such a concrete use case.

Aug 16 '22 22:08 tlively

I hear that.

For now, some history of the Boost interval package going back to 1999 or so (beware, not HTTPS):

http://www.mscs.mu.edu/~georgec/IFAQ/maurer1.html

"Originally, I only intended the package to assess rounding errors in lengthy computations. Since then, I've learned that there are other application domains where interval arithmetic is useful" <- And with this quaint understatement, the analog security era was born!

The point being simply that this is hardly a new innovation (not that anyone here seems to think that). Note that G. Bill Walster was deeply involved. More to come by way of current examples.

Aug 17 '22 01:08 KloudKoder

To address your specific comment about broadening the scope, I certainly don't oppose that. I just don't personally know of any practical uses for RM tweaking apart from interval arithmetic. The one sort of exception would be integer rounding. On X86, for one: "The FRNDINT instruction returns a floating-point value that is the integral value closest to the source value in the direction of the rounding mode specified in the RC field". Not sure how float-to-int is implemented in WASM, but in theory it would be amenable to RC flavors just like any other floating-point instruction, but would have this very different effect (nudging by up to 1.0 as opposed to one unit of the last place (ULP)).

Aug 17 '22 02:08 KloudKoder

TL;DR: Some algorithms are polynomially slower without interval arithmetic; others just simply don't work.

Alright, I promised to list some practical examples. On the one hand, you're obviously not going to find people ranting that they can't get intervals to be performant (or even correct, due to the lack of compliant libraries) in WASM because, like, what purpose would that serve without going through the entire process that we're literally going through here and now? From the public discourse I've read on various forums, it's clear that people just assume they need a specialized library to run intervals in a performant manner on a particular piece of hardware. Pending any change to WASM, that seems to be just how it is.

Therefore, in the absense of any published rant about this, I'll attempt to provide the next best thing, which is a list of compelling applications that I've gathered from public sources. Here goes:

Adaptive exploration. I've touched on this previously, but its significance can hardly be underestimated for the sake of performance. We're not talking hundreds of percent to be gained here (as would potentially be the case with an efficient implementation of rounding flavors). We're talking about changes to the computational complexity of an algorithm. And why? Because adaptive exploration means that you can say "Search around this space until you know that the global optimum is at least X and at most Y" instead of "Search this space exhaustively and return whatever optimum that you find". Think about what that implies: you can quit early if you learn that the minimum "around here" is too big to be the global minimum (or maximum). This can result in actual polynomial acceleration of a variety of exploratory algorithms (and especially recursive ones). For example, you can write a raytracer that says "Trace this light ray back from this pixel to all its reflections in the scene until you know the color to within the limits of the color encoding used on the display". That's revolutionary and extremely fast compared to the traditional and rather arbitrary "Trace this ray back through the first N reflections".
Cognitive safety. Imagine you have a neural network which is classifying objects in front of a moving car. Suddenly, an object appears which seems to be a man crossing the street. Do you slam on the breaks, risking injury to the driver? Or do you ignore the classification as a false positive, and risk injuring the man? For safety's sake, you need to know the interval on which the classification rests. At minimum, a wide confidence interval should be fed back to the manufacturer in order to alert it to the poor performance of the classifier in this situation.
Physical performance guarantees. You're designing an engine that needs to be at least 21% energy efficient. You run millions of engine simulations in parallel, using genetic optimization to discover the best combination of hyperparameters. Finally, you discover one that's actually 23% efficient. But how do you know that's not just numerical error? Unless you're using intervals, you don't.
Nonlinear equation solvers. Suppose you have a set of differential equations, such as Navier-Stokes for fluid dynamics. Most of these systems, such as flames dancing on a candle, end up being subject to the butterfly effect. However, chaos doesn't evolve instantly, so there's a finite time window during which one can actually predict the evolution of the system with an acceptable degree of uncertainty. Without interval arithmetic, you won't have any idea about the extent of that window. But with its help, the future state of the system will become increasingly blurry, such that you can literally see the butterfly effect taking hold at each subsequent timestep.
Matrix operations. Complicated operations on large matrices can cause numerical errors to compound significantly, for example, when determining an Eigenvalue or a determinant. Without interval arithmetic, you don't know what the error is, which could potentially be huge in the case that large matrix elements cancel but for some tiny residue. Also, even ambiguity in the sign of a determinant which is close to zero can create havoc in predicate-based exploration ("If negative, move to this halfspace, else move to that halfspace...").
Path planning. This comes up in video games, among other applications. You have an object, say a spaceship, that needs to fly through a 3D maze. But it needs to do so without overlapping the solid walls of the maze. The spaceship is a complicated object but it can be bounded by a 3D interval for the sake of expeditious collision detection. Take away interval arithmetic, and you don't know if an overlap actually occurs or not, which could cause the code to take the wrong path relative to what the user expects to occur, for example causing the spaceship to explode even though it wasn't moving fast enough to hit the wall.
Root searches. One famous example is the Riemann Hypothesis, which states that the nontrivial zeroes of a particular complex function all have real part 0.5. While this can't be proven numerically (because there are an infinite number of them), searches for counterexamples are possible (for example, Zeta Grid). One can start anywhere in the relevant domain, converge on a root, and in theory demonstrate that the real part is something other than 0.5. (There is literally a $1M prize on the line.) But ordinary floats can't give you this. You need intervals to prove that the real part of your root is, say, between 0.6 and 0.7.
Deterministic convex optimization. Convex optimization is the process of finding the "bottom of the bowl", so to speak, in N dimensions. Conceptually, it's straightforward because convex functions have, by definition, just one global minimum (with some trivial exception cases). So you can chase the derivative using Newton's method and end up there pretty quickly. But in reality, because of accumulating numerical error and various quantization effects, it could happen that your estimate of the global minimum (or maximum) suddenly starts getting less and less accurate as you continue attempts to refine it. But you don't know that unless you're using interval arithmetic. With its help, you can watch as the interval on which the minimum lives shrinks with each iteration, until you hit a timestep at which it starts to expand, so you need to stop the algorithm. Otherwise it's just "iterate N times and hope for the best".

Is that helpful?

Aug 21 '22 23:08 KloudKoder

Is that helpful?

It's a list of situations where interval arithmetic may be useful. Nobody argued that interval arithmetic is never useful. (I think some of the specific entries on this list are quite debatable, but that's beside the point.)

What folks were asking for is a concrete example of an application/library/etc that would like to adopt Wasm but currently can't because of lack of control over rounding modes.

To state it very clearly, there are two issues: (1) People across the ecosystem are busy. There are many possible things to work on (just look at the list of in-flight proposals!). In order to convince anyone to work on this, some evidence is needed why this is impactful. "I think it could be useful for X, Y, Z" does not qualify as sufficient evidence, neither does "I think this will be useful once Wasm runs on GPUs". (2) Assuming someone (maybe yourself?) starts working on this, to make a proposal advance through the phases, feedback will be required: is it working correctly? Is it as helpful as expected? Is it performing well enough? If all you have is a proposed specification and a prototype implementation, you can't really answer these questions. There'll have to be a project that says "we previously couldn't adopt Wasm, but now we can thanks to this" or "our previous Wasm port was slow, but adopting this feature yielded great improvements", or similar. (Of course, it could also happen that the feedback is "we thought this would unblock us, but then we experimented with the prototype and it turned out that it doesn't actually solve our problem", which would be very useful feedback as well, and could cause either redesign or abandoning of the proposal.)

To reiterate: nobody has said that we should never add support for control over rounding modes to Wasm; on the contrary, people were generally supportive of this feature. But to actually start working on it, there needs to be evidence of concrete usefulness. And to decide what the details should look like, there needs to be a collaboration with actual users of the feature.

If your statement:

you're obviously not going to find people ranting that they can't get intervals to be performant (or even correct, due to the lack of compliant libraries) in WASM

is another way of saying "yeah, admittedly, nobody is asking for this", then realistically that means that nobody is going to work on it either, unless and until someone does ask for it.

Aug 22 '22 10:08 jakobkummerow

I have a few progress updates on this. Basically I've been trying to contact the team leaders or authors of various interval applications across a variety of languages. Based on the feedback I've received, the common themes are as follows:

Some have made it clear that they won't even discuss this with the WASM CG because they see it as hopeless, based on similar discussions in the past. For example, there was apparently a proposal long ago to add rounding mode support to C++, but it got into the same chicken-and-egg problem of no developers because of no compiler support and no compiler support because of no developers.
However, this position isn't universal. I've had a number of papers pushed my way with recommendations of "just show them this and they'll see why this support is so important". When I've explained that it's actually examples of uncompilable/nonperformant code that the team wants, the developers I contacted don't seem to want to discuss their designs. I guess I shouldn't be surprised at that.
2 teams in particular ended up having to abandon floating-point entirely due to various limitations. One of them ended up writing critical analog logic in custom fixed-point code, and the other ended up just using integer intervals instead (which only works efficiently for a subset of interval applications). Needless to say, neither deployed a WASM application using floating-point interval support.

For my own part, there are a few interval-dependent projects that I've wanted to deploy in WASM for a long time, but of course I realize that the point here is to surface third parties with a similar problem. (I haven't complained about this on any forum, so my own problems in this regard are no more searchable than anyone else's.)

It's a tough slog but I haven't given up.

Sep 06 '22 01:09 KloudKoder

I'd like to chime in to support this proposal. I'm a collaborator to inari, a Rust interval arithmetic library. We would definitely like to have this library work on WASM. I use inari to do computer assisted proofs and it would be great to be able to make some of them available to colleagues and students as web apps. I am not the only one in need of this: someone ported inari to WASM but, due to the lack of rounding primitives, the author of this library chose to forego the inclusion guarantees that are essential to any valid interval arithmetic library. 😬

The author of inari agrees to port the library to WASM once the required primitives are in core::arch::wasm32 and I agree to the perform some testing. That implies that you will have to collaborate with the developers of Rust so we can make use of the rounding primitives.

I hope that directed rounding primitives will materialize in WASM. That will be good for the web but also to send executables to colleagues who don't program so can then play with them safely, whatever their operating system (executing them with Wasmtime).

Oct 07 '22 11:10 Chris00

Thanks for that info @Chris00.

I support the addition of a rounding immediate to all floating point bytecodes that do rounding. It turns out to also be useful in situations short of full-blown interval arithmetic; e.g. comparing a floating point number to an integer that cannot be represented exactly. E.g. if we compare f < i, then we want to round i towards positive infinity before doing a float comparison against f, and if we have i < f, we want to round i towards negative infinity before comparison.

Unfortunately because of the binary encoding, that means encoding rounding modes in immediates requires new, prefixed bytecodes. (in hindsight, I wish we had reserved a 0 byte for floating point opcodes).

Oct 07 '22 13:10 titzer

@Chris00 "That implies that you will have to collaborate with the developers of Rust so we can make use of the rounding primitives." It's possible to port Inari to WASM even without Rust support (although Rust would be the next obvious target for the inclusion of rounding primitives). You could do this by manually coding a few lines of WASM instructions in order to obtain the desired behavior (once primitives are supported), then calling the resulting WASM module from Rust as needed with some sort of wrapper that works on the library or LLVM level. Then at some point if Rust agrees to nativize support, you can remove the wrapper. Granted, this approach is bit awkward, but you could have your library available on the Web before the Rust team even decides what they want to do. Right?

@titzer Yes indeed. Totally forgot about that. It happens if, say, your 64-bit integer won't fit into a 64-bit float because the 53-bit mantissa is so small. (This didn't happen on the 8087 math coprocessor from 1982 because it converted everything internally from 64-bit to 80-bit precision, unless you manually downshifted to 64-bit precision through the PC bits. But in SSE2, which I assume is how WASM implements its floating-point, it's really truly just 64-bits, so you have the problem. Ditto on any other modern architecture I'm aware of.) At that point, the comparison result is sort of random because you can't control the rounding policy. Speaking of which, if this feature request gets implemented, then we should look at the existing integer-to-float and float-to-integer functionality in WASM and ensure that it, too, ends up with the same 4 modes as all the other floating-point instructions.

Oct 08 '22 20:10 KloudKoder

Anyone still not satisfied with the use cases on the table?

@conrad-watt Can we get something off the ground here so that the Inari team can prototype their library, probably along the lines of the switch statement macro you proposed? Let me know what you need.

Oct 11 '22 22:10 KloudKoder

I think we have enough motivation here to spin up a phase 1 proposal for per-instruction rounding mode control. That's just a matter of writing down the proposed instructions and putting a presentation and phase 1 poll on a CG meeting agenda. The tricky part will be after that when you have to find folks (possibly including yourself) with the time and motivation to work on implementations.

Oct 12 '22 02:10 tlively

@tlively Sorry for the slow reply. If you have any link to a previous phase 1 proposal that you think is well written, please share it. I'll try to make mine conform to the same format so as to avoid confusion. Once it's done I'll schedule a review and poll at a future CG meeting.

I'm willing to help on the implementation (with the caveat that I'm hardly a WASM expert).

Oct 14 '22 22:10 KloudKoder

There's no specific format you need to match, and the exact content will depend on the proposal. Here's an example of a phase 1 proposal presentation for a change to validation: https://docs.google.com/presentation/d/1-ajjGZpjAiGYOJlwswij9Mq6YGltmuELg3tbhu30VrI. That presentation was given at this meeting: https://github.com/WebAssembly/meetings/blob/main/main/2020/CG-09-29.md.

Basically you want to cover 1) the motivation(s) for the proposal, 2) an initial design for what would be implemented, and 3) an explanation of how it would be used and by whom.

Oct 17 '22 15:10 tlively

@Chris00 @tlively (and all concerned). Working on this. I'm trying to get the required instructions down to as little complexity as possible. The 4 standard RMs are a must, but there also needs to be sign inspection as well (mainly for multiplication and division of intervals). I notice that WASM already has f32/64.copysign, but when I click on the link on this page, it redirects to a page about float instructions which doesn't discuss the copysign itself. (The whole WASM site seems to have this weird CSS setting where you can't search for text within a page because the hits are not highlighted, making it very difficult to search for specific definitions.)

In any event, f32/f64.copysign isn't sufficient because the output is just another float. What's needed is a way to efficiently read the high bit of a float. I guess this can be done via f32/64.reinterpret_i32/64, but that sort of idiomatic casting makes it hard for the compiler to do its job efficiently, so "Get Logical Sign" and "Get Arithmetic Sign", both from float to integer, would be useful to have.

The other issue concerns -0.0 vs +0.0. There are ways of redefining the floats so that this distinction is meaningful, but in the interval world, they're identical because they amount to the same boundary. Allowing the distinction to persist throughout a given computation can create the appearance of unequal bounds, which can result in improper branching. (For example, the reciprocal of (-0.0, +0.0) is the reals, whereas the reciprocal of (+0.0, +0.0) is +infinity.) Therefore, we need a way to flush them both to +0.0. This would likely occur often because multiplication (and division) have radically different results depending on the particular signs of the respective upper and lower bounds. Therefore I'm leaning toward the need for a "Unify Zeroes" instruction, but again, it could be done through either (1) more branching on every single multiply or (2) hacking -0.0 to +0.0 by passing through the integer world.

One small piece of good news is that I now suspect we can avoid (1) any sort of "Get ULP" instruction and (2) the need for outward rounding (antichop). This is because you need to inspect the signs of the bounds of each interval that you're multiplying before doing so, so therefore you already know how to round each of the resultant bounds in light of their signs.

Any thoughts on this are welcome. Barring that, I'll just partition my proposal into "necessary" vs. "nice-to-have".

Oct 30 '22 01:10 KloudKoder

In any event, f32/f64.copysign isn't sufficient because the output is just another float. What's needed is a way to efficiently read the high bit of a float. I guess this can be done via f32/64.reinterpret_i32/64, but that sort of idiomatic casting makes it hard for the compiler to do its job efficiently, so "Get Logical Sign" and "Get Arithmetic Sign", both from float to integer, would be useful to have.

Another option is to copysign the sign bit onto a 1.0 value, and then compare that. That said, I wouldn't necessarily object to a "read the sign bit into an integer" instruction. I'd guess that comparing as less-than-zero using the lt instruction suffices for the "arithmetic sign".

The other issue concerns -0.0 vs +0.0. There are ways of redefining the floats so that this distinction is meaningful, but in the interval world, they're identical because they amount to the same boundary. Allowing the distinction to persist throughout a given computation can create the appearance of unequal bounds, which can result in improper branching. (For example, the reciprocal of (-0.0, +0.0) is the reals, whereas the reciprocal of (+0.0, +0.0) is +infinity.) Therefore, we need a way to flush them both to +0.0. This would likely occur often because multiplication (and division) have radically different results depending on the particular signs of the respective upper and lower bounds. Therefore I'm leaning toward the need for a "Unify Zeroes" instruction, but again, it could be done through either (1) more branching on every single multiply or (2) hacking -0.0 to +0.0 by passing through the integer world.

An efficient way to unify zeros is to add 0.0 to a number, since -0.0 + 0.0 evaluates to 0.0.

That said, I'm not very familiar with interval arithmetic, but this approach to interpreting negative zero surprises me. In the reciprocal function f(x) = 1/x, the limit as x approaches zero isn't just positive infinity. It's positive infinity or negative infinity, depending on which direction you approach zero from. I would have guessed the interval arithmetic would want to represent a conservative bound, which would mean you wouldn't it want to ignore the possibility that a value with an interval at zero might represent a zero approached from the negative side, or even just a finite negative value which got rounded to zero.

Oct 31 '22 14:10 sunfishcode

@sunfishcode You raised some important considerations.

First of all, at least in AMD64, you can't just do comparisons on floats. There needs to be a signal propagation to the RFLAGS register (in the integer unit) in order to conduct a subsequent branch. There are lots of ways to do this, but the point is that it takes a number of instructions and, unfortunately, some heavy pipeline synchronization. (This is hidden to some extent by branch prediction, but it's still costly.) That's why I think we're stuck with setting an integer based on the result of the float comparison. Unless I'm mistaken, WASM doesn't have any equivalent of RFLAGS, so producing an integer result is the next most efficient approach:

"Get Logical Sign" means: "Copy sign bit from float into integer, and clear all but the LSB of the integer".

"Get Arithmetic Sign" means: "If float is -0.0 or +0.0, then set integer to 0. Otherwise, if float is negative (anything, even NaN), then set integer to -1. Otherwise set integer to 1."

I actually tested (-0.0)+(+0.0) and got +0.0. I'm not in a position to expediently verify whether this is standard in IEEE754, but if it is, then you deserve major credit for eliminating the need for "Unify Zeroes"! (If nobody else knows then I'll have to dig in.)

As to the 1/x limits thing, yeah, it's a serious problem. (Namely, math doesn't have a notion of -0.0, but IEEE754 does because it's used idiomatically as (-epsilon).) The fundamental question is whether or not you want to retain sign information. In other words, should the reciprocals of (-0.0, -0.0), (-0.0, +0.0), and (+0.0, +0.0) all be one and the same, or unique? We could just say "It's whatever the RM says it is under IEEE754". And we'll have to. But does that then leave the library providers with cases that they can't efficiently handle? If so, then we may need another instruction. (I'm leaning towards that being unnecessary, but I'm not super confident that I haven't overlooked something.)

I guess one way to resolve this impasse is to say that if you do want those intervals to be handled as one and the same, then you're going to have to first zero-unify the limits by adding +0.0 to both of them.

What do you think?

Ping @Chris00

Oct 31 '22 21:10 KloudKoder

"Get Logical Sign" means: "Copy sign bit from float into integer, and clear all but the LSB of the integer".

f64.copysign(1.0, x) < 0 implements this. But I wouldn't necessarily be opposed to adding a dedicated opcode for this if there's a performance case for it.

"Get Arithmetic Sign" means: "If float is -0.0 or +0.0, then set integer to 0. Otherwise, if float is negative (anything, even NaN), then set integer to -1. Otherwise set integer to 1."

Ah, ok. This looks a little more involved, but it probably also comes down to how much a dedicated opcode would help performance in realistic use cases.

I actually tested (-0.0)+(+0.0) and got +0.0. I'm not in a position to expediently verify whether this is standard in IEEE754, but if it is, then you deserve major credit for eliminating the need for "Unify Zeroes"! (If nobody else knows then I'll have to dig in.)

It is standard in IEEE 754, and it's in the testsuite.

Also, I figured out the error in my question about negative zeros. If there's a non-zero negative value which rounds up to -0, interval arithmetic would already have a non-zero lower bound that is less than it. So whenever you'd have a -0 lower bound, it really is the same as +0. There still are some tricky situations with approaching zero from the negative direction, but I expect these are just part of the trickiness that comes with treating "infinity" like a number.

Nov 01 '22 19:11 sunfishcode

First of all, I guess you've really found a way to avoid having "Unify Zeroes". Kudos!

Secondly, what about your f64.copysign(1.0, x)? According to the list of instructions, wouldn't that only work if x were a float? I see only one definition for f32/64.copysign, which is "[f32/64 f32/64] -> [f32/64]" (which really doesn't make sense because it has 2 inputs). But when I put "copysign" in the search dialog I get a whole bunch of nonspecific links to nothing in particular about copysign. Someone will hopefully make the website easier to navigate at some point, but for now I guess I'm stuck unless I can dig out a more navigable spec somewhere (or even a lousy PDF). Anyways, let's sync this issue first so we understand what value a new opcode would or would not offer.

Third, back to -0.0. Consider this:

(-1.0, -0.9)/(+infinity) = (-0.0, -0.0) (-1.0, +0.9)/(+infinity) = (-0.0, +0.0)

But as you effectively pointed out, both of these intervals are just points, namely zeroes. This is literally correct interval arithmetic even after rounding (and IEEE754 behaves as such, I believe). The problem is that now your code thinks the outputs are different because, hey look, the bits don't match (even if you f32/64.reinterpret)! So you branch off the wrong way and it's all downhill from there. However, I now think the solution is simply to use your +0.0 addition trick whenever and wherever it matters; we don't need any special instruction to fix this, even if the code has to worry about it all the time.

Unless maybe @unageek (or anyone else lurking) has some problem with it.

Nov 01 '22 23:11 KloudKoder

The semantics of copysign are given here: https://webassembly.github.io/spec/core/exec/numerics.html#op-fcopysign

copysign_n(z1, z2)

if z1 and z2 have the same sign, then return z1

else return z1 with negated sign.

Nov 01 '22 23:11 tlively

It's also worthwhile noting that fN.copysign is not typically a hardware instruction; at least not on any of the ISAs that V8 supports. It's done with a couple of instructions of bitwise logic.

Nov 02 '22 14:11 titzer

@tlively @titzer Thanks. Definitely copysign isn't going to fit the bill, unfortunately. It's necessary to branch based on the state of signs prior to interval multiplication, for instance. Because this is going to be done as often as interval multiplication itself, this would support the creation of the aforementioned pair of "get sign" type of instructions.

I'll be creating a proposal for CG review soon. Meanwhile, anyone with further concerns is invited to speak up.

Nov 03 '22 13:11 KloudKoder

I've been reading through the WASM instruction set reference (thanks @tlively) so as to try and not mess this up. Here's what I'm thinking:

There are a few ways to do this, but ultimately the simplest is probably with "metaops", i.e. instructions that tell the compiler how to compile. (This could be used for lots of other stuff in the future like branch hints, cache proximity hints, reentrant compilation, atomic transactions, secure enclave implementation, etc.) We might as well get started with a generic approach that could be repurposed for all those other potential uses down the road.

Rather than have a single opcode for all metaops, which would just result in another level of indirection, we can assign one for each specific purpose. By definition, metaops don't do anything visible in and of themselves; they have side effects which may manifest on subsequent instruction boundaries, or only in the analog world, e.g. power consumption or latency effects.

In this case, we need 4 opcodes: round nearest-or-even (implied in effect for backward compatibility), round negativeward, round positiveward, and chop. In the interest of simplicity, it doesn't matter if you use these as prefixes (i.e. immediately prior to a float instruction) or not. You could even be silly and put lots of them in a long chain after a return instruction, if you like. (Why tax the validator with such innocuous concerns?) In any case, their effect should persist until either a different mode is selected, or one of the following instructions is encountered: br, br_if, br_table, return, call, or call_indirect. Those instructions are guaranteed to reset back to nearest-or-even because we don't want to deal with weird cases where a branch target ends up with a superposition of RMs, or a call instruction ends up bouncing off the OS and suffers some resulting reset of the float pipes (can that even happen?). Ultimately, as I've said before, the optimizer should rearrange instructions so as to minimize the number of RM control register touches (but this is not required).

And BTW the RMs affect not only arithmetic instructions, but the sort of conversions mentioned above, wherein the target lacks sufficient precision to fully represent the source. It's even possible that one RM will produce a finite approximation while another RM will generate an infinity in response to the same input.

There are also the aforementioned sign extraction instructions, but those are straightforward.

Thoughts?

Nov 07 '22 03:11 KloudKoder

@KloudKoder, the direction sounds good! As far as terminology goes, I would describe these new instructions as normal instructions that modify new rounding mode state made explicit in the store rather than as "metaops." We want to avoid the impression that the behavior of these instructions is meant to be outside the formally specified semantics.

For features that actually are outside the specified semantics, there is a separate effort being led in the branch hinting proposal to develop a framework for code annotations. I want to emphasize that rounding mode control should not be using that framework because it does affect semantics.

I believe @titzer previously expressed a preference for adding individual new arithmetic and conversion instructions with explicit rounding behavior rather than introducing new rounding mode state and instructions to modify it and I share that preference. At least for optimizing engine tiers, I wouldn't expect this to have a performance difference, especially since you envision the rounding mode being reset at the end of blocks anyway. It would be great if you could comment on the trade offs you see between these two designs as part of your presentation.

Nov 07 '22 15:11 tlively

@tlively It sounds like you're advocating for 2 different schemes. (Sorry if I'm just misinterpreting your comment.) To clarify, do you actually want many "new arithmetic and conversion instructions with explicit rounding behavior" or 4 new instructions with "the rounding mode being reset at the end of blocks"? (One of the foregoing plus the 2 sign extractors, actually.)

Yep, I got your point about metaops, which is fine.

Nov 08 '22 11:11 KloudKoder

It's an unfortunate encoding mistake, in retrospect, that we didn't reserve a 0 byte for floating point ops. But as it is, I think we could carve out a prefix page for floating point ops that include an explicit rounding mode, and then put (actually, repeat) all ops that use a rounding mode under that prefix, with the rounding mode explicit in them.

My preference against the implicit rounding mode state option is that it makes otherwise pure instructions dependent on that global state. I get it that most hardware works this way, but my impression is that hardware architects don't like that either. RISC-V actually has both: the rounding mode field is large enough to encode 4 rounding modes plus a dynamic rounding mode. AFAICT they did this because of the software dependency on managing the rounding mode state inherited from previous architectures.

Nov 08 '22 13:11 titzer

Strongly agree with what @titzer and others have said: an ambient mode state should be avoided, not least because it has terrible composability, i.e., gets in the way of transformations like inlining.

Nov 09 '22 12:11 rossberg

@titzer @rossberg Fine by me, with the understanding that your validator now has to worry about the following:

No control transfers allowed to the byte after the RM prefix. So you can't jump into the middle of "chop multiply" in order to just "multiply" for instance.
No redundant prefixes, e.g. "nearest-or-even chop divide".
No prefixes to nonfloat instructions, e.g. "round-positive return".
No prefixes to float instructions with no RM sensitivity, e.g. "round-negative f32.min".
Functions (and even just scopes) cannot terminate with prefixes.

Implicitly, prefixes have no lifetime beyond the prefixed instruction itself; it all reverts to nearest-or-even thereafter.

OK?

Nov 10 '22 02:11 KloudKoder

@KloudKoder, technically, none of this affects validation, these are solely questions of binary format parsing, i.e., syntax. In particular, note that (1) isn't an issue at all, since branch targets are given by block labels, not random code positions.

Also, when @titzer says prefix, I think he doesn't mean mode prefixes, but the likes of group prefixes 0xfc, 0xfd, 0xfe, which we already use for all instructions related to GC or SIMD etc. Not sure if new float instructions would even need their own prefix, they could easily go under the existing 0xfc, which already hosts saturating truncations, for example.

In short, this is neither new nor a complication. :)

Nov 10 '22 07:11 rossberg

@rossberg Thanks, I get what you mean. So in fact just 3 additional flavors of each RM-sensitive instruction, all probably living under the 0xFC opcode group, along with the 2 sign extractors wherever we can find the space. Alright, then, I'll compose a proposal and report back.

Nov 10 '22 21:11 KloudKoder

Alright, I think I've identified every single instruction whose behavior would be altered by a change in RM. They're as follows:

91 f32.sqrt 92 f32.add 93 f32.sub 94 f32.mul 95 f32.div 9F f64.sqrt A0 f64.add A1 f64.sub A2 f64.mul A3 f64.div B2 f32.convert_i32_s B3 f32.convert_i32_u B4 f64.convert_i64_s B5 f64.convert_i64_u B6 f32.demote_f64 B7 f32.convert_i32_s B8 f32.convert_i32_u B9 f64.convert_i64_s BA f64.convert_i64_u FD 5E f32x4.demote_f64x2_zero FD E3 f32x4.sqrt FD E4 f32x4.add FD E5 f32x4.sub FD E6 f32x4.mul FD E7 f32x4.div FD EF f64x2.sqrt FD F0 f64x2.add FD F1 f64x2.sub FD F2 f64x2.mul FD F3 f64x2.div FD FA f32x4.convert_i32x4_s FD FB f32x4.convert_i32x4_u FD FE f64x2.convert_low_i32x4_s FD FF f64x2.convert_low_i32x4_u

I couldn't find any documentation on the last 2, so I need to confirm that they're actually impacted by RM. (Are they taking 2 products of 2 i32s? Or concatenating them or what?) If I've neglected or wrongly included something, please speak up.

In keeping with the KISS principle, and considering that we're in no desperate shortage of root-level opcodes, I'd like to propose the creation of 3 additional prefixes (tentatively F9, FA, and FB), which serve as RM indicators for (round-positive, round-negative, and chop, respectively). This would match the existing opcode order for existing "ceil", "floor", "trunc", and "nearest" instructions provided in the spec (but FC would be skipped). This would, in turn, result in minimal complexity growth for upstream compilers. It would also minimize the pain of adding additional float or SIMD instructions in the future. The rules would be simple:

If you're using nearest-or-even with any instruction in the list, then nothing changes.
If you're using round-positive, round-negative, or chop; then use prefix opcodes F9, FA, or FB, respectively. Follow this by the equivalent nearest-or-even encoding, but skip the leading FD if it's present.

In the interest of simplicity, I'll post about sign extraction separately.

Nov 20 '22 14:11 KloudKoder

design design copied to clipboard

Floating-point rounding mode control prototyping

design
design copied to clipboard