design icon indicating copy to clipboard operation
design copied to clipboard

Why not return quotient and remainder in one instruction?

Open KloudKoder opened this issue 3 years ago • 24 comments

Perhaps this due to instruction formatting constraints, but it seems silly not to provide both when computing one implies computing the other. I guess you could leave it to compiler optimization, but that seems painful. Intel's DIV and IDIV do this natively, although of course they only do it if the quotient fits in half the space available to the dividend.

Alternatively, I suppose you could provide performance guidance along the lines of: "please make sure there are no branch instructions or branch targets between the divide and remainder instructions (and have the divide first, maybe) so that the compiler can optimize them into a unified instruction under the hood".

KloudKoder avatar Oct 11 '20 02:10 KloudKoder

The ability to return multivalue values from one instruction is a fairly recent development in WebAssembly, so it simply wasn't an option when the MVP arithmetic was designed. This would be fair game for a new proposal, though. It might be good to bundle it with other newly-possible instruction like add-with-carry, etc.

tlively avatar Oct 12 '20 19:10 tlively

@tlively I'm pleased to hear that. I don't know why most languages don't seem to provide this. And I think add-with-carry is another must as well, now that you mention it.

KloudKoder avatar Oct 13 '20 23:10 KloudKoder

And I want a multiplication that provides the high and low as well, for many of the same applications.

wbl avatar Oct 23 '20 03:10 wbl

@wbl I second that. The high should be the true high including complete carry propagation, as opposed to an approximation.

KloudKoder avatar Oct 24 '20 01:10 KloudKoder

There are several issues that make me want to keep it simple. One is CPU support. A doubling width multiply is very common, while variants that can add in a register into the product and carry out the result are much rarer: UMAL is the only one I can think of. Secondly the cost of properly propagating is two add with caries, because adding into a product doesn't carry out, so it's not that bad to have three instructions in a row, which is what you'd do in assembler anyway.

The bigger problem is the add with carry instruction: not all CPUs have a carry flag, and its frequently clobbered in uncomfortable ways. I've got to think hard about this one: ideally a compiler can do a good job without too much work, while making the usual usage simple and efficent.

wbl avatar Oct 24 '20 06:10 wbl

@wbl Addition with carry (and subtract with borrow) could be implemented efficiently on nonIntel CPUs by introducing a virtual carry bit and a set of add and subtract instructions which set it according to the result, so that it can be propagated along. On Intel, this register comes directly from the carry flag. On other architectures, it would need to come from nonbranching predicates which implement set-on-wrap (slightly different logic for add and subtract, but you get the idea). Also, saving way the carry flag on Intel is easy: SETC AL. (Please don't PUSHF.)

For the sake of architectural elegance, I guess it would make sense to have a whole bunch of bit registers, any one of which might be used for the carry input or output in any given instruction, so you could have "Add X to Y to C0 and produce sum Z with carry C1".

KloudKoder avatar Oct 25 '20 03:10 KloudKoder

I think this still matters.

I would also like to point out one corner case to consider, which already has relevance with existing remainder computation in WASM: having zero as the dividend. (When the divisor is zero, the result is undefined. This makes sense (as long as the actual behavior is secure) because we shouldn't be wasting overhead checking for this. But I'm talking about the numerator.)

Some languages say that:

0 mod 5 = 5

while others say that:

0 mod 5 = 0

Both are true; it depends on exactly what you mean by "mod(ulo)". If you mean "the remainder when 0 is divided by 5", then it's the former. If you mean "0 expressed in base 5", then it's the latter.

I just bring this up because the spec indicates the former, but it's not clear to me that all supported silicon behaves identically in this regard.

Probably it's all fine, but someone might want to check.

KloudKoder avatar Apr 17 '21 16:04 KloudKoder

I have proposed a discussion of this issue at the May 11, 2021 community group meeting:

https://github.com/WebAssembly/meetings/compare/master...KloudKoder:patch-2

Probably we should have:

  1. Divide to produce quotient and remainder (perhaps with the limitation that neither may exceed half the dividend's width, as with classic X86). Consider carefully what happens when 0 is either dividend or divisor. Signed and unsigned results will differ, so we need separate instructions.
  2. Multiply 2 factors to obtain a product at double the width, including proper carry propagation. Signed or unsigned.
  3. Add-with-carry. (This always struck me as a silly name for an instruction. Of course you want to use a carry when you add! What it refers to is the addition of the carry bit as well. The carry bit would need to be provided in a register if in fact there is no carry flag on a given architecture, in which case it's "add 2 integers along with bit 0 of some other register".)
  4. Subtract-with-borrow. Same as #3 but backwards.
  5. Double-precision shift, allowing 2 integers to behave as one of double the size. By constant or variable distance.
  6. Double-precision rotate by constant or variable distance. (Not critical but nice to have.)

KloudKoder avatar Apr 23 '21 02:04 KloudKoder

I don't think you have proposed this yet (the link is a fork comparison, rather than PR). Also, what would those 6 instructions be for?

penzn avatar Apr 23 '21 17:04 penzn

@penzn Thanks for pointing that out. PR submitted.

"Also, what would those 6 instructions be for?"

  • Divide and multiply from int(X) to or from int(2X) are used all the time in large integer libraries. They also come up all the time in practical programming, for example if you have an array where the length and width are both 32-bit, but the area might need 64-bits. For example, "allocate X objects of size Y" requires determining whether X*Y overflows the address width.

  • Add-with-carry and subtract-with-borrow are important in large integer libraries as well. Even just dealing with 128-bit integers is a major problem in their absence. (I'm surprised that WASM has neither integer nor float support for them, but that's another discussion.)

  • Double-precision shifts are important for data (de)compression performance. Because it's often rather serialized, (de)compression tends to constitute a major source of overhead and is thus worthy of optimization.

  • Double-precision rotation is debatable. My intuition tells me that it's useful in gaming and crypto, but I admit that I don't have a compelling example.

There are native accelerating instructions for all of the above, other than the last one, in X86 and AMD64. Not sure about other architectures.

KloudKoder avatar Apr 24 '21 07:04 KloudKoder

Previous discussion: https://github.com/WebAssembly/design/issues/1021

lars-t-hansen avatar May 11 '21 07:05 lars-t-hansen

Here's my design for today's CG meeting. No doubt it will be debated and revised.

ADD-WITH-CARRY (ADC) & SUBTRACT-WITH-BORROW (SBB):

Having read thread from @lars-t-hansen (thanks), and based on my experience writing written multiprecision functions, I note the following constraints:

  1. We don't want to have multivalue inputs / outputs in use unless absolutely necessary. While I generally have no problem with them, in this case, they might be problematic because we don't want to burn an entire 64-bit register on a stupid carry flag -- at least not for transient carry propagation in, say, 128-bit addition.

  2. We don't want to touch the actual flags register on any ISA, if we can avoid it. In other words, we don't want to define it as "the" virtual carry flag, and leave it there forever. It's far too fluid to dedicate for that purpose.

  3. While it would be nice to have alternative flavors which only consume, or only produce, carries, they're not worth paying any complexity for. (The existing addition and subtraction instructions neither produce nor consume.)

  4. There is a tradeoff here: On the one hand, we don't want to burn a whole register and create movement overhead between the flags register and the virtual carry register in order to just propagate a carry from some low part to its corresponding high part. On the other, we absolutely need a place to dump the carry in case it needs to be used sometime far later. What to do?

I would suggest:

  1. You need to use produce-and-consume adds and subtracts if you want to do anything with the carry at all. In deference to efficiency, I'm actually in favor of supporting consume-only and produce-only, but it's not a big deal for me.

  2. There is a virtual carry register. It starts as zero and is only altered by carry-aware instructions. At that point, it survives only until the next branch (including returns). The first carry-aware instruction following a jump or return target is defined to consume the carry as zero. This allows us to be lazy up front (just clear the carry prior to every branch, but only in the currently nonexistent situation that a given binary contains any carry-awareness at all) but efficient in the future (be aware of branch targets and only clear the carry when necessary). It's convenient, at least on X86, that logic instructions automatically clear the carry for us.

  3. If the carry needs to survive long term, then we need to use CarryToInteger and CarryFromInteger. The former sets bit zero of an i32/i64 variable to the value of the carry flag. The second does the opposite. It's overkill to destroy the rest of the register contents, which the programmer might not want to do. But this is debatable because SETC AL, to cite an X86 example, would be a faster way to do this than manipulation of bit zero.

  4. Parity, sign, and zero flags could be copied to or from the carry flag or an integer register via dedicated instructions. Parity is expensive and would benefit from optimization, but the other 2 are probably not worth the trouble. That said, I'm not sure where parity is really a bottleneck these days.

DOUBLE-PRECISION SHIFT LEFT (DSL) & DOUBLE-PRECISION SHIFT RIGHT (DSR):

This "just works" if we use multivalue instructions. Let's do it that way. We shouldn't support shift-through-carry because it's not enough bang for the buck.

DOUBLE-PRECISION ROTATE LEFT (DRL & DOUBLE-PRECISION ROTATE RIGHT (DRR):

My intuition tells me this should exist, and it's minimal trouble if we implement DSL and DSR. The only caveat is that we need one additional scratch register.

DIVIDE-TO-SMALLER (D2S) & MULTIPLY-TO-BIGGER (M2B):

  1. D2S64 divides an i64 by an i32 to produce an i32 quotient and an i32 remainder. This will facilitate more optimization in future compilers. It can even afford performance gains retroactively, if someone detects the use of both quotient and remainder within the same scope. (WASM could also do so if it wanted to.) Division by zero should produce a defined result, even though it's inevitably nonsensical. If this is already moot by virtue of an exception, so be it, but merely defined behavior would be nice.

  2. D2S32 dividends an i32 by the low 16 bits of another i32 in order to produce an i32 quotient and and i32 remainder (with 16 high zeroes in both cases). Optionally, we could leave the result as a packed i16:i16.

  3. D2S16 dividends the low 16 bits of an i32 by the low 8 bits of another i32 in order to produce an i32 quotient and and i32 remainder (with 24 high zeroes in both cases). Optionally, we could leave the result as a packed i18:i18.

  4. D2S128 divides an i128 encoded in a pair of i64 variables by an i64 in order to produce an i64 quotient and an i64 remainder. Optional but very helpful in bigint contexts.

  5. M2B64 inputs a pair of i32 registers and outputs an i64 product (which is exactly correct including proper carry propagation).

  6. M2B32 inputs the low 16 bits of a pair of i32 registers and outputs an i32 product. This is needed because it enables performance gains by explicitly utilizing a cheaper multiply.

  7. M2B16 inputs the low 8 bits of a pair of i32 registers and outputs an i32 product.

  8. M2B128 inputs a pair of i64 registers and outputs the product as an i128 encoded in another such pair. Optional but very helpful in bigint contexts.

KloudKoder avatar May 11 '21 12:05 KloudKoder

For the record, the presentation of this issue to the community group is being rescheduled, tentatively for June 8. This is due to cascading technical failures on my part which prevented me from participating today, despite a valiant attempt by the hosts to assist. I know how to prevent this from occurring next time.

KloudKoder avatar May 11 '21 17:05 KloudKoder

I'm having a hard time seeing that a "virtual carry register" fixes any actual problems. In an optimizing JIT, we'll build an SSA graph from the wasm input and then optimize that graph before allocating registers and pasting up the machine code. These optimization passes may merge and split and move blocks and may therefore change the semantics of the virtual carry unless they are careful to track that flag exactly; that is, with a virtual carry register the graph builder will just have to reconstruct the information about the carry flag that the program could have been providing in the first place. Representing the carry explicitly as an input and output of the carry-aware instructions seems much cleaner. This does not commit the JIT to "dedicating a 64-bit register" to the flag except when the flag is live at the end of the instruction that produces it (live in the sense that there is another instruction later that reads the output value) and cannot be kept in the flags register until it is consumed. Representing it explicitly has the nice side effect of getting rid of several instructions that manipulate that flag.

I was noodling around with some wasm bytecode sequences for MP arithmetic and it seems like there are several competing styles for instructions that take a carry in (due to the stack machine and no let to manipulate the stack yet): what is the order of inputs and results? An unrolled carry-aware loop that likes to push all the input operands first probably wants the carry as the last input argument:

   i32.load offset=0 (local.get $p)
   i32.load offset=0 (local.get $q)
   i32.load offset=4 (local.get $p)
   i32.load offset=4 (local.get $q)
   i32.load offset=8 (local.get $p)
   i32.load offset=8 (local.get $q)
   i32.const 0  ;; carry in
   i32.adc      ;; (carry-out, result) <- add-with-carry(x, y, carry-in)
   i32.store offset=0 (local.get $r)
   i32.adc
   i32.store offset=4 (local.get $r)
   i32.adc
   i32.store offset=8 (local.get $r)
   drop         ;; carry out

while a loop that likes to load operands as they come along would prefer the carry as the first input argument:

   i32.const 0   ;; carry in
   i32.load offset=0 (local.get $p)
   i32.load offset=0 (local.get $q)
   i32.adc       ;; (carry-out, result) <- add-with-carry(carry-in, x, y)
   i32.store offset=0 (local.get $r)
   i32.load offset=4 (local.get $p)
   i32.load offset=4 (local.get $q)
   i32.adc
   i32.store offset=4 (local.get $r)
   i32.load offset=8 (local.get $p)
   i32.load offset=8 (local.get $q)
   i32.adc
   i32.store offset=8 (local.get $r)
   drop         ;; carry out

and really the same question is there for the outputs: placing the carry-out on top of the stack forces a style where the results are processed after all the operations are done; placing it underneath (as above) forces the processing of the result after each operation.

Loading the inputs late and storing them early is kinder to register-poor architectures. It may be that multiple styles of instruction are desirable.

An enumeration of use cases for add-with-carry and subtract-with-borrow would be useful. If MP arithmetic is the main use case, perhaps MP arithmetic instructions would be a better choice.

lars-t-hansen avatar Jun 10 '21 09:06 lars-t-hansen

@lars-t-hansen I guess the main use case for carry propagation is large int libraries (and sometimes encryption or hashing), which due to this very issue, tend to require manual coding in assembly language for efficient carry propagation. However, without much loss of performance, one could implement them with WASM's ints: just use the low 31 (or 63) bits for storage and the high bit for carry propagation. The only weird side effect is the need for slightly more storage (or at worst double, e.g. a 64-bit int would now take 2 64-bit WASM ints). After results have been finalized, they could be recompressed in order to conform to standard representations.

But if that's too ugly and you want an explicit carry register, then you would need multiple inputs and outputs to ADC/SBB. But then I have to wonder whether the overhead incurred in producing and consuming that extra register, on top of the fact that it wastes a whole register to begin with, is ultimately any cheaper than the foregoing expansion method.

With the expansion method, ADC/SBB would look like this on Intel:

shl r0, 1 adc/sbb r1, r2 shr r0, 1

where "shr" is an unsigned right shift that also works for signed large ints because only the top word has signed behavior. Note that there would be 3 inputs because we need to know where to get the carry/borrow. At least in this case, you're only wasting a bit per register instead of an entire register.

While I don't like expansion, either, I'm offended by the idea of a global carry bit such as we have on Intel. It's just a bottleneck that constrains the entire system. (I don't think you're proposing that, though.)

KloudKoder avatar Jun 12 '21 04:06 KloudKoder

On second thought, with expansion, you have to pay the cost of expansion and recompression, which is probably higher than just burning a whole register for a carry. So in the latter case, "ADC r0, r1, r2" would look like this on Intel, where r0 is the destination, r1 is the source, and r2 is the carry (both in and out):

xor r3, r3 add r0, r2 adc r3, r3 xor r2, r2 add r0, r1 adc r2, r3

and "SBB r0, r1, r2" is then:

xor r3, r3 sub r0, r2 adc r3, r3 xor r2, r2 sub r0, r1 adc r2, r3

In theory, at least, an optimizer could then go through this code and realize that it could be replaced with equivalent native ADC/SBB instructions. But even if not, it's pretty quick. It's just rather serialized, so hopefully that would be hidden by memory latency in most cases.

KloudKoder avatar Jun 12 '21 04:06 KloudKoder

I'm not sure what instructions you would provide for MP arithmetic other than the core doubleword ops and some form of carry propagation. MP arithmetic has a sizeable number of algorithms that are all accelerated by offering the foundational operations and word representation of numbers, and new ones occasionally get discovered. Of course it's the case that WebAssembly is Turing complete, so any computation can be expressed in it, but what we want is to express the algorithms efficiently without complicating the underlying translation to machine code too much.

What Javascript did with native bignums was provide a few arithmetic operators, with abyssal performance in some applications because of the inability to access the representation of the numbers. One can't avoid an expensive multiprecision division at every step of modular exponentation, as there is no way to express Montgomery reduction. One has to use Euclid's algorithm, rather than a half-gcd algorithm. That's the disadvantage of providing special purpose operations.

wbl avatar Jun 12 '21 04:06 wbl

@wbl Yeah I agree. This issue is really just proposing that: carry propagation, double-precision shift/rotate, and expanding/contracting multiplication/division.

KloudKoder avatar Jun 12 '21 04:06 KloudKoder

But if that's too ugly and you want an explicit carry register, then you would need multiple inputs and outputs to ADC/SBB. But then I have to wonder whether the overhead incurred in producing and consuming that extra register, on top of the fact that it wastes a whole register to begin with, is ultimately any cheaper than the foregoing expansion method.

@KloudKoder, I still don't understand what you mean about "that extra register". Under reasonable assumptions the compiler should be able to keep the carry bit in the flags register for stretches of code. Performance-sensitive wasm programs would unroll their loops to take advantage of that capability. All I'm arguing is that this dependency (the carry value) should be explicit in the wasm code, not implicit as hidden state, since the latter confers no advantage and is generally a mess.

@wbl, what I had in mind (loosely) was CISCy instructions of the form

   <carry-out> <== i32.adc <result address>, <source address 1>, <source address 2>, <count>, <carry-in>

which are easier to optimize than individual instructions where we have to optimize carry propagation between them to keep the carry in the flags register. The compiler would unroll the loop implicit in the instruction, and if the count is constant it could do so extremely well, we already do this for memory.copy, for example. Here there representation is not hidden. But there is an assumption about memory-to-memory operation and perhaps that is a showstopper for some.

lars-t-hansen avatar Jun 14 '21 12:06 lars-t-hansen

@lars-t-hansen I see, so what you want is an efficient instruction to say "ADC/SBB this memory region to that memory region and store the result over there", with an input for the initial carry and an output for the final carry. Yes? And would there also be a countless version that just does it for a single i32/i64, in order to avoid all the overhead of determining whether "result" overlaps "source" or "destination" in such a way as to make forward walking impossible? In a memory copy, we can just copy backwards if necessary. But we can't add backwards. Maybe if there's any overlap other than all 3 regions being the same, we should just define the whole thing as a no-op because it's just bad.

The "extra register" I was referring to was the register required to store the carry. Using the carry flag internally does help in your proposed memory-to-memory version, but you're also forcing a carry in and out, which will consume an extra register (and a few clocks) relative to not requiring that. In my original proposal, the carry is indeed implicit, but only survives until the next branch unless you explicitly extract it out to a register beforehand. I'm not sure if there's a way to say "don't generate this output" or "input as zero" in WASM (which would allow your carry in and out to shunted in cases where they're unnecessary). In many RISC architectures, this would just be reads or writes from or to r0.

KloudKoder avatar Jun 15 '21 13:06 KloudKoder

@KloudKoder, regarding overlap for a memory-to-memory operation, that's a good point, but if the spec is written appropriately I think a separate countless version is probably not necessary. If the spec were to say, for example, that the behavior of the memory-to-memory ADC instruction must be as-if operand words are read, operated upon, and the result stored before the next operand words are read, then there would be no ambiguity if the values overlap, nor any particular need to check for overlap I think. (The compiler's hands would be tied a bit as to streaming the operations and eg using 64-bit operations when available, but this doesn't seem deeply critical.) And if the count is a constant 1 there might not be a need to check for overlap under any circumstances, as the result would never be written until after the operands have been read.

I emphasise that the memory-to-memory op is just an alternative that I thought ought to be discussed as to how it addresses use cases; it may still be that a simple register-to-register primitive would be better for more use cases.

Re the extra register: If the carry flag is initially set from the value zero (something the compiler will usually see) then no register will hold that zero; the compiler will just emit CLC. If the final carry-out flag has no consumer then the compiler should generate no code to capture it in a register. Of course, if there is a reason to propagate the carry across regions where it can't be kept in the carry flag then it must be held in a register, but I don't think that's what you're getting at.

lars-t-hansen avatar Jun 21 '21 13:06 lars-t-hansen

@lars-t-hansen You covered it brilliantly. Yes, your "as if" method kills streaming (for example converting i32 operations to i64 operations) but frankly if you're dumb enough to do bigints with i32 then you get what you deserve. And SIMD is of no help because it doesn't propagate carries, so no opportunity cost there. The only real loss is maybe the ability to do i128 under the hood, but I don't know of any architecture that currently supports an ADC like that.

I'm leaning toward just doing it as you suggested but getting rid of troublesome references to the carry flag that will force the compiler to worry about consumers across the whole spaghetti branch landscape. In other words, if you really need a carry flag to exist even after we provide you with memory-to-memory, then you need to reserve an extra bit or more after the MSB of the bigint, or do something gross like wrap testing. Niche cases don't deserve optimization, unless somehow this isn't a niche case? Tell me what I'm not seeing here.

KloudKoder avatar Jun 23 '21 15:06 KloudKoder

@KloudKoder, regarding the final carry-out, in a normal compiler pipeline this will not really cause any extra work because tracking consumption of that value is just part of what the compiler does already, and then it's a very small amount of work to capture the carry-out or not depending on whether the value has any users or not. So one could go either way on that design point. Whether having the carry is useful or not for programs is something I think I'm less well qualified to determine.

I think we'll want to get some use cases onto the table here, now that we have sketched out some alternative designs, to see how the designs meet those cases. There's also the matter of how these designs can be used effectively by tools (eg compilers), if at all. 30s of googling finds a little bit of related work: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173, presumably there's more out there.

lars-t-hansen avatar Jun 25 '21 13:06 lars-t-hansen

@lars-t-hansen I think it's telling that that 2017 feature request on GCC is still marked "NEW" as of 2021. So yeah, that underscores the justification for caring about this.

I take what you said to mean that WASM's own compiler (and not some generic compiler) is smart enough to do nothing with the final carry if in fact it's not consumed elsewhere. That would be sufficient, in my view, to justify defining the instruction in such a way that a carry output is always required. The trouble is, though, that doing so then forces a higher level language to allocate a variable for that final carry, in order to have it ignored by the WASM compiler. That seems awkward, but maybe it's unavoidable. (I take it that WASM doesn't have a way to say "trash this output", which it probably should but it's likely too late.)

Having written a lot of multiprecision code myself, I can't think of a single case where I made use of a final carry out, mainly because the maximum sum (or product) size is knowable ahead of time. But assuming that someone takes the lazy approach of expand-on-carry, then we have another problem, which is that we actually need expand-on-overflow if adding signed bigints. So then we need: 32/64, add/subtract, and signed/unsigned flavors of your instruction. Perhaps that's still better than forcing a subsequent carry/overflow test involving inspection of the high words of the sum and addends.

What do you think?

KloudKoder avatar Jun 28 '21 15:06 KloudKoder

... One can't avoid an expensive multiprecision division at every step of modular exponentation, as there is no way to express Montgomery reduction. One has to use Euclid's algorithm, rather than a half-gcd algorithm. ...

Seems, signed right shift operator and BigInt.asUintN(n, bigint) allow to implement Montgomery reduction and half-gcd.

Yaffle avatar Jul 11 '21 10:07 Yaffle