riscv-isa-manual icon indicating copy to clipboard operation
riscv-isa-manual copied to clipboard

Extend 48 and 64 bit instruction length unary encoding style to 128 bits

Open brucehoult opened this issue 6 years ago • 22 comments

At present 48 bit and 64 bit instructions are encoded in the first 16 bit parcel as:

xxxxxxxxxx011111 48 bits xxxxxxxxx0111111 64 bits

This is a good scheme. Each extra 16 bits of opcode gets 15 bits of usable new space.

But then:

x000xxxxx1111111 80 bits x001xxxxx1111111 96 bits x010xxxxx1111111 112 bits x011xxxxx1111111 128 bits ... x110xxxxx1111111 176 bits (22 bytes)

While this works out well for the longer sizes it leaves the 80 bit encoding definitely too tight to include a 64 bit literal -- if you wanted both int64 and fp64 literal loads to an arbitrary register then that's the entire 80 bit opcode space used (except perhaps rd = 0).

I propose:

xxx0xxxxx1111111 80 bits xx01xxxxx1111111 96 bits x011xxxxx1111111 112 bits 0111xxxxx1111111 128 bits 1111xxxxx1111111 reserved for >= 144 bits (18 bytes)

Pros and Cons

Pro: 80 bits gets two more bits of encoding space, and 96 bits gets one more bit of encoding space.

Con: 128 bits gets one bit less of encoding space

Con: 144, 160, 176 bits and 192 bits and higher lose three bits of encoding space (assuming a counting scheme using two bits in the next 16 bit bundle for 144, 160, and 176) Note that 128 bits (16 bytes) is already a byte longer than the maximum allowed instruction length on AMD64.

Pro: postpone more complex decoding logic for longer?

Pro: loading int64 and fp64 immediates to an arbitrary int or fp (as appropriate) register can be done in an 80 bit instruction, while still leaving 3x more unused 80 bit opcodes than with the current scheme. There would even be room for jal64 and destructive addi64, andi64, ori64 .. just as an example .. while still leaving as much unused space as with the present scheme.

To me, encoding space at 80 bits especially still feels precious. Being able to load a double precision floating point value from the instruction stream seems like it's probably a significant win.

brucehoult avatar Nov 28 '18 07:11 brucehoult

I don't think we're proposing to ratify the >32-bit instruction-length encoding scheme at this time, since there are no proposed instructions longer than 32 bits. So, we can continue this discussion past the base-ISA ratification period.

I understand the desire to extend the unary instruction-length encoding pattern. OTOH, two more considerations countervail your proposal: (a) no one really wants to build instruction-fetch units that can't determine the instruction length from the first parcel, so "reserved for >= 144 bits" means "max. 144 bits" in practice; (b) bit 15 would ideally not participate in the length encoding scheme, because it already serves as part of the rs1 field.

(We could perhaps split the difference and extend the unary encoding by only one notch, which would accomplish your goal of expanding the 80-bit encoding space, but still supporting up to 144-bit instructions and avoiding eating bit 15.)

aswaterman avatar Nov 28 '18 10:11 aswaterman

Wild and crazy idea...

Instead of having some binary encoded "number of 16 bit packets" field in the next (2nd) 16 bit packet ... what about if 1111xxxxx1111111 means the instruction is 128 bits long (just like 0111xxxxx1111111) plus what looks like the next instruction, encoded in the normal way.

So, for example, a 144 bit (18 byte) instruction is 1111xxxxx1111111 followed by seven arbitrary 16 bit chunks, followed by xxxxxxxxxxxxxxaa with aa != 11 (standard 16 bit encoding)

Pro: supports arbitrarily long instructions with a finite encoding scheme. Pro: some tools (and maybe even some parts of instruction fetch/decode) could be slightly simpler.

Con: serial decoding of very long instructions.

brucehoult avatar Nov 28 '18 10:11 brucehoult

We should optimize the encoding under the assumption that it will be unpopular to define instructions whose length cannot be encoded within the first parcel.

aswaterman avatar Nov 28 '18 10:11 aswaterman

My reading of the spec is that the encoding scheme for 48- and 64-bit instructions is fixed. If that is not the intent, the spec should be updated to make that clear.

asb avatar Nov 28 '18 10:11 asb

@aswaterman I certainly agree with that. We want to leave the door unlocked, but maybe a bit sticky...

What is going to make an instruction longer than 16 bytes -- that is 14 bytes, plus maybe 1 bit, of useful payload? OK, put a 64 bit literal in a 128 bit instruction and you've still got 48 (49, with your compromise) bits of space for register fields -- let's say four of those for 20 bits. So 28 (29) bits left. That's a heck of a lot of opcodes, rounding modes, negations of operands etc.

What are you going to do to exceed a 128 bit instruction, other than including multiple big literals (not very RISCy). Ok, there are 128 bit literals one day. A 144 bit instruction will let you do basically load immediate and nothing else.

You could say that basically 144 bits is it, but make rd = 0 the escape hatch to longer instructions if someone really really wants them.

brucehoult avatar Nov 28 '18 10:11 brucehoult

@asb which text suggested 48- and 64-bit encodings are fixed, but 80+-bit encodings are not? (Not disputing your interpretation; I just don't recall having read that recently.)

aswaterman avatar Nov 28 '18 11:11 aswaterman

@brucehoult I'm not making a case for longer instructions; I'm just observing that instruction-length reach is a resource being traded here.

aswaterman avatar Nov 28 '18 11:11 aswaterman

@asb which text suggested 48- and 64-bit encodings are fixed, but 80+-bit encodings are not? (Not disputing your interpretation; I just don't recall having read that recently.)

My reading is that 48, 64-bit, and 80+bit encodings are fixed. But I care most about 48 and 64-bit as they're the most likely ones people are likely to use in non-standard extensions. See section 1.4, figure 1.1 and the associated text. It doesn't explicitly say "this is fixed", but I think the natural assumption when reading a spec is that things are fixed unless they're marked as subject to change :)

""" Figure 1.1 illustrates the standard RISC-V instruction-length encoding convention. All the 32-bit instructions in the base ISA have their lowest two bits set to 11. The optional compressed 16-bit instruction-set extensions have their lowest two bits equal to 00, 01, or 10. Standard instructionset extensions encoded with more than 32 bits have additional low-order bits set to 1, with the conventions for 48-bit and 64-bit lengths shown in Figure 1.1. Instruction lengths between 80 bits and 176 bits are encoded using a 3-bit field in bits [14:12] giving the number of 16-bit words in addition to the first 5×16-bit words. The encoding with bits [14:12] set to 111 is reserved for future longer instruction encodings. """

If 48/64-bit encodings aren't fixed, the text should IMHO say something like "One potential encoding for 48 and 64-bit instructions is ... However, the standard encoding scheme for such instructions is not yet fixed".

asb avatar Nov 28 '18 11:11 asb

OK, I was confused by the 48-/64-bit part.

We should make it clear which parts of the Introduction, and which aspects of the instruction-length encoding scheme, are up for ratification.

aswaterman avatar Nov 28 '18 11:11 aswaterman

@aswaterman yes, but if maximum length instruction possible encoded in the first 16 bit packet (leaving room for rd and rs1) was the overriding criterion then you'd skip the unary encoding for 48 and 64 and go straight to a 5 bit binary count of packets for everything past 32 bits, giving a maximum of 48 + (16 * 31) = 544 bits (68 bytes) entirely specified in the first packet and leaving bits 15 and 7-11 untouched.

That ought to be enough for anyone.

brucehoult avatar Nov 28 '18 12:11 brucehoult

@asb I'd think technically everything is subject to change, as nothing has been ratified yet. If you've build hardware already then you know (or should have known) that at some level you're taking a risk.

More practically, I imagine it would take pretty extraordinary circumstances to cause a change that breaks already-shipped hardware.

But things that no one has implemented yet are surely still up for discussion and re-evaluation?

brucehoult avatar Nov 28 '18 12:11 brucehoult

@asb I'd think technically everything is subject to change, as nothing has been ratified yet. If you've build hardware already then you know (or should have known) that at some level you're taking a risk.

More practically, I imagine it would take pretty extraordinary circumstances to cause a change that breaks already-shipped hardware.

But things that no one has implemented yet are surely still up for discussion and re-evaluation?

Yes, I think there's probably flexibility there though I would have a lot of sympathy for anyone who had a WIP design and was stung by an encoding scheme change. Really I meant that if the intent is not to ratify the current proposed encoding schemes then that should be made clear in the manual.

asb avatar Nov 28 '18 12:11 asb

@aswaterman what about extending by 32 bits at a time after 96?

xxxxxxxxxx011111 48 bits xxxxxxxxx0111111 64 bits xxx0xxxxx1111111 80 bits x001xxxxx1111111 96 bits x011xxxxx1111111 128 bits x101xxxxx1111111 160 bits x111xxxxx1111111 reserved OR 192 bits if rd = 0 is the escape mechanism

160 bits has 150 bits left after taking out the fixed bits. That's enough for a 128 bit literal, plus 22 bits for register fields and opcodes: could be rd, rs1, rs2 and still seven bits of opcode.

brucehoult avatar Nov 28 '18 12:11 brucehoult

@asb I agree. If there is anything that is not proposed to be ratified in this round then that should be stated. The default assumption is everything in the document will become canon.

brucehoult avatar Nov 28 '18 12:11 brucehoult

We will add clarification to the introduction that only the 16- and 32-bit encodings are being ratified at this time.

aswaterman avatar Nov 30 '18 22:11 aswaterman

Closing this for now, as I don't think there's anything actionable until more use cases for 80-bit instructions emerge.

aswaterman avatar Dec 21 '18 01:12 aswaterman

Bruce Hoult just mentioned this issue to me in a private email and I have an alternative suggestion for the instruction length encoding:

x FFF RRRRR ??bbb11 32-bit (bbb != 111)
x FFF RRRRR 0011111 48-bit
x FFF RRRRR 0111111 64-bit
x FFF RRRRR 1011111 80-bit
x FFF RRRRR 1111111 96-bit and more

Legend:
x     = bit 15, aka rs1[0]
FFF   = funct3
RRRRR = Rd field

For insns with >= 96-bit the length would be 96 + 16*FFF, with N=111 reserved.

For instructions with up to 80 bits length we could call funct3 the "major opcode" and funct7 the "minor opcode" in this format.

This would cost one bit of encoding space in 48-bit instructions, but we would gain a bit in the 80-bit encoding space, and we would streamline the instruction format for the whole range from 32-bit instructions to 80-bit instructions. That would allow us to use the same instruction encodings for load immediate instructions with 32-bit immediate, 48-bit immediate, and 64-bit immediate.

For long load-immediates instructions I would propose something like the following encoding for the LSB 16-bit of the instruction:

E000RRRRRSS11111

E     .... 1 = ones-extend the immediate to XLEN, 0 = zero-extend the immediate to XLEN
RRRRR .... destination register
SS    .... size of the immediate (00=32-bit, 01=48-bit, 10=64-bit)

I'd propose that the immediate for those instructions would just follow in the next bytes, LSB first, without any bit permutations.

This would mean we'd spend 1/8th (12.5%) of the 48/64/80 bit encoding space on integer load-immediate instructions. I think that's absolutely justified, given the general importance of load-immediates. (I would expect that for many cores large load-immediate instructions will be the only instructions >32 bits they will support. So we might as well make sure they have a straight-forward streamlined encoding.)

cliffordwolf avatar Apr 22 '19 12:04 cliffordwolf

I generally like @cliffordwolf's scheme It has the benefit's he mentions, plus allows up to 192 bit instructions with the length encoded purely in the first 16 bit packet.

One point is there is no explicit mention of what to do with FP literals which I think are at least as important as integer ones. One option is to use a 2nd value in the FFF field for them, 100 say, or 001.

Perhaps it's useful to be able to specify expanding a 32 bit FP value to the 64 bit encoding instead of NAN-boxing it, in which case the E bit can be used for this.

It's a bit annoying that 48 bit FP values aren't a standard thing, so that combination might go unused, but we'd like to be able to load FP16 literals but there's no obvious way to do that with less than a 48 bit instruction.

brucehoult avatar Apr 22 '19 20:04 brucehoult

This proposal also seems OK, but I don’t think we should commit to anything until we have a clearer idea of how the 48-bit space will be used. Instructions that load FP/int literals seem attractive but I do wonder if their quantitative benefit is being assumed to be greater than it actually is. We should employ a quantitative approach to this problem. (I realize there are considerations beyond total program size, like I$/ITLB pressure being lower for some apps, or load-use delay, but those can be quantified, too. There are also situations where the constant pool is slow to access, but that can usually be worked around with extra effort at load/boot time.)

If FP16 literal loads actually prove important enough, they could be provided with a new 32-bit instruction that needs 21 operand bits, or instead with a LUI plus a new 32-bit R-type instruction that copies the literal from bits 31:16 of an integer register into bits 15:0 of a float register.

aswaterman avatar Apr 22 '19 22:04 aswaterman

One point is there is no explicit mention of what to do with FP literals which I think are at least as important as integer ones. One option is to use a 2nd value in the FFF field for them, 100 say, or 001.

Of the top of my head, I think there are only three candidates for instructions with a single large immediate and just a destination register:

  • Load integer immediate
  • Load FP immediate
  • Jump-and-link

I don't think there is a need for a large-immediate-variant of AUIPC, but maybe I just don't see the obvious use-case right now..

That would leave space for 635 instructions using funct3 as "major" and funct7 as "minor" opcode.

Perhaps it's useful to be able to specify expanding a 32 bit FP value to the 64 bit encoding instead of NAN-boxing it, in which case the E bit can be used for this.

If you implement "Q" then maybe you actually want three instructions:

  1. NaN box as 32-bit float
  2. Convert to 64-bit float and NaN box as that
  3. Convert all the way to a 128-bit float

But since I wouldn't know what to do with the E bit for jump-and-link we could fit the third load fp-immediate into the other half of the jump-and-link opcode encoding space.

cliffordwolf avatar Apr 23 '19 17:04 cliffordwolf

Another obvious use for a "major opcode" would be mirrors of the 32-bit OP and OP-FP opcode spaces, with large immediates. Theres a neat encoding for this that I really like:

  • Store the original "rs1" field in the rs? field for the immediate.
  • Then set rs1[4:2] to the funct3 of the original opcode
  • Use rs1[1:0] to indicate which operand is the immediate (with 11=reserved)

For OP-FP with large immediate the immediate would be converted into the precision the instruction expects, as indicated by funct7[1:0].

Furthermore I'd reserve the major opcode "111" for an "escape" mechanism into larger instruction encoding spaces. Here opcode E=1 would always be reserved for custom extensions, and with E=0 the rd field would select one of 32 "pages", and we start new with an opcode[6:0] field in the next 16 bit word. This guarantees we'll never run out of encoding space for instructions of any length, if they are rare enough that the 16 bit prefix can be afforded.

Including those we'd have the following major opcodes used:

For 48-bit instructions:

  1. Load 32-bit integer immediate
  2. Load 32-bit FP immediate
  3. Jump-and-link
  4. Escape to 32 more "pages"

For 64-bit instructions:

  1. Load 48-bit integer immediate
  2. Jump-and-link
  3. "OP" with 32-bit immediate
  4. "OP-32" with 32-bit immediate
  5. "OP-FP" with 32-bit immediate
  6. Escape to 32 more "pages"

For 80-bit instructions:

  1. Load 64-bit integer immediate
  2. Load 64-bit FP immediate
  3. Jump-and-link
  4. "OP" with 48-bit immediate
  5. Escape to 32 more "pages"

Starting with 96-bit instruction length, where funct3 is being use to indicate the length, the first 16-bit word would just select a "page" in rd. Thus "OP" and "OP-FP" with 64-bit immediate would be a 112 bit instructions. But they would live in a vast encoding space.

plus allows up to 192 bit instructions with the length encoded purely in the first 16 bit packet.

After that I'd just use the rd field to indicate the remaining length in 16-bit units. And the whole first 16 bit word would just be a prefix into a single "page". E=1 would always be reserved for custom extensions.

I think that would enable instructions of up to 864 bits. (Modulo off-by-one errors in my math. :)

Finally 7FFF (E=0) would be reserved as a prefix for longer instructions and FFFF (E=1) would be reserved as a guaranteed illegal instruction.

This proposal also seems OK, but I don’t think we should commit to anything until we have a clearer idea of how the 48-bit space will be used.

Yes, but: I think it makes sense to set a tentative roadmap for this now, as this roadmap can inform how instructions are allocated in the 32-bit encoding space. (Because of things like that mirrors of OP and OP-FP instructions.)

Also, I would assume that there is a bit of a chicken-egg problem going on here. In order to get a better idea of how the 48-bit space will be used, it seems necessary, or at least useful, to put out a concrete suggestion that people can then criticize.

Instructions that load FP/int literals seem attractive but I do wonder if their quantitative benefit is being assumed to be greater than it actually is. We should employ a quantitative approach to this problem.

Absolutely. But I think the best way to do that is to throw out a suggestions that then can be discussed and implemented in compilers and benchmarked accordingly.

I realize there are considerations beyond total program size, like I$/ITLB pressure being lower for some apps, or load-use delay, but those can be quantified, too.

Yes, but this not only depends on the app but also on microarchitectural choices. And maybe CPU projects with this kind of requirements simply don't like RISC-V so much, not take the charge to lead the standardization effort for larger RISC-V instructions.

Consider for example large processors with many hearts and many many hyperthreads for GPU-like applications. There you really want to avoid random memory access for loading constants, because that memory bus is a highly contested resource on those machines.

Or, on the other end of the spectrum, processors that use very slow ROM (such as a QSPI flash) where random memory access for loading constants can be a huge performance killer.

I think to a large degree there's a "If you build it, they will come." aspect here. Where nothing will happen until we first throw out a proposal.

And it's not like the current spec simply says that the prefix "11111" is reserved for longer instructions. No, it contains a concrete proposal for the length encoding. If we can improve on that, then I think we should do that. And if we create some structure where opcodes can be allocated, then chances are higher that people will start using larger instructions, even if maybe only for custom extensions.

cliffordwolf avatar Apr 24 '19 07:04 cliffordwolf

I've expanded a bit on the above ideas and posted the resulting proposal here: http://svn.clifford.at/handicraft/2019/rvlonginsn/README

cliffordwolf avatar Apr 29 '19 09:04 cliffordwolf