riscv-cheri icon indicating copy to clipboard operation
riscv-cheri copied to clipboard

SCBNDSI redundancy?

Open nwf opened this issue 1 year ago • 7 comments
trafficstars

SCBNDSI has a 5-bit immediate, uimm, and a gated (by s) shift-by-4 operation, in https://github.com/riscv/riscv-cheri/blob/1c43ce2fe5688057d6108a5f901574f2dac0acd0/src/insns/scbnds_32bit.adoc?plain=1#L43 which means that there are two ways to request bound lengths of 0 (unshifed or shifted 0) and 16 (unshifed 16, shifted 1). Perhaps "shifted 0" and "shifted 1" could be put to better use (as requests for 512 and 528, say)?

nwf avatar Oct 02 '24 18:10 nwf

Avoiding redundant encodings here would be nice and I agree that having shifted zero be 512 would make a lot of sense here. Not sure how much that increases decoder complexity, but if we are already special casing, maybe we could also have shifted 1 be something more useful like 1024 or 4096?

arichardson avatar Oct 10 '24 16:10 arichardson

I'd like to see some distribution of object sizes before thinking about adding that. And it will probably be different on 32- and 64-bit systems. It might be worth marking those as explicitly invalid encodings that are reserved for future expansion.

davidchisnall avatar Oct 10 '24 16:10 davidchisnall

I'd like to see some distribution of object sizes before thinking about adding that. And it will probably be different on 32- and 64-bit systems. It might be worth marking those as explicitly invalid encodings that are reserved for future expansion.

That is a good argument - I believe @jonwoodruff did some analysis here. For now I've opened a PR to reserve these encodings.

arichardson avatar Oct 10 '24 21:10 arichardson

These are data samples from Spec2006, averaging distributions from each of 10 benchmarks.

This first set is for flat "bits of precision". It starts at 1 bit because we assume signed values. 12 reaches 100% because we're just measuring the used sizes in RISC-V code generation from CHERI ISA-v9, which only has a 12-bit immediate.

  1 2 3 4 5 6 7 8 9 10 11 12
incoffseti 0.14% 41.46% 47.54% 53.55% 62.63% 69.90% 72.67% 76.13% 87.26% 97.36% 98.02% 100.00%

And here are results with a "scale bit" that shifts by 4, which was the point that hit the highest number of cases. Presumably this is because it facilitates pointer-aligned arithmetic. Unsurprisingly, this matches choices in the Morello ISA.

  1 2 3 4 5 6 7 8 9 10 11 12
incoffseti 0.14% 40.06% 49.83% 66.03% 72.97% 85.50% 96.57% 97.22% 98.09% 98.29% 98.53% 100.00%

I don't know if this informs how effective adding an additional value, e.g. 512, to the spectrum. I can dig up the old scripts and add that condition in. There is at most 10% on the table (the jump from 9 bits to 10 bits; we would get a single value from the 10-bit space, but probably the most common one).

jonwoodruff avatar Oct 11 '24 01:10 jonwoodruff

Just to clarify, it looks like this is the data for incoffset? I would imagine the values used for setbounds are somewhat different.

arichardson avatar Oct 11 '24 20:10 arichardson

https://github.com/riscv/riscv-cheri/pull/418 is merged, so can we close this now?

tariqkurd-repo avatar Oct 14 '24 12:10 tariqkurd-repo

I agree the immediate issue is fixed, but maybe we should keep this open as a future enhancement to allow for larger values?

arichardson avatar Oct 14 '24 13:10 arichardson

After some out-of-band discussion today, I'd like to add "YBNDSW with rs2=0" as further redundant encoding space. (And "YBNDSW with rs1=0", too, but for different reasons than the encoding of static known lengths as discussed herein.)

nwf avatar Sep 19 '25 17:09 nwf

After some out-of-band discussion today, I'd like to add "YBNDSW with rs2=0" as further redundant encoding space. (And "YBNDSW with rs1=0", too, but for different reasons than the encoding of static known lengths as discussed herein.)

That is an interesting point - we could reuse that immediate and reuqire use of YBNDSW with x0 to create zero size capabilities. I wonder if we should reserve it for a future common value (e.g. 512/1k/2k/4k) based on some further benchmarking?

arichardson avatar Sep 19 '25 19:09 arichardson

@rmn30 (with a bit of editing on my part; brilliance his, mistakes mine) suggests

s ? ((uimm + 3) << 4) : (uimm + 1)

which gives us 1 through 32 (inclusive) and then multiples of 16 from 48 (inclusive) through 544 (inclusive). 0 is available as YBNDSW with rs2=0.

nwf avatar Sep 19 '25 23:09 nwf

@rmn30 (with a bit of editing on my part; brilliance his, mistakes mine) suggests

s ? ((uimm + 3) << 4) : (uimm + 1)

which gives us 1 through 32 (inclusive) and then multiples of 16 from 48 (inclusive) through 544 (inclusive). 0 is available as YBNDSW with rs2=0.

I was concerned about introducing an immediate that isn't literally encoded (or a shift of that literal), but it looks like zibi is now introducing one that is imm==0b11111?-1:imm+1, so having addition might be a good way to improve immediate utilization.

arichardson avatar Sep 19 '25 23:09 arichardson

Interesting - it' s a 5-bit immediate so the adders will be very small so I think that's probably ok. Can we show good utilisation on the bounds from SPEC which @jonwoodruff has above? Tha table above shows the number of bits - but I don't know if the actual values match the requested ones....... I would say that if we're going to change this then now is the time...... I've added the blocker tag to show that we need to resolve this before freezing the spec

tariqkurd-repo avatar Sep 22 '25 11:09 tariqkurd-repo

We've started doing some work here. We use csetbounds to materialise globals on CHERIoT so make heavy use of this instruction and have been trying to determine if we need the 11-bit-immediate csetboudns in Xcheriot2 or if we can use this, either as is or with some small tweaks.

davidchisnall avatar Sep 22 '25 11:09 davidchisnall

From a very small, quick study I got the best coverage just by using all integers from 0-64 (two instructions with 5-bit immediate + ybdsw with rs2=0). We need to do a bigger study to confirm and it's obviously specific to the CHERIoT ABI but suggests a quantitative approach is worth doing.

rmn30 avatar Sep 22 '25 13:09 rmn30

0-63? that certainly would be simpler. I think this is probably a sensitve area, as bounds setting is one of those things CHERI does which legacy RISC-V doesn't, so minimising the overhaed will help with performance results.

But it could be that this is all noise anyway and we just go for the simplest......?

tariqkurd-repo avatar Sep 22 '25 13:09 tariqkurd-repo

  1. I am pretty sure it's worth doing something to avoid 0 as something YBNDSWI (nee SCBNDSI) can express, because that's already available as YBNDSW with rs2=0.

  2. There's a large space of design options here, including encoding tricks within the R-type space, multiple R-type instructions, an additional I-type (like CHERIoTv1), ...

  3. The desirable set of values might correlate with the application domain (that is, CHERIoT and 64-bit Big-CHERI might have quite different distributions).

  4. Given all that, it would be worth someone over in the CL / CheriBSD part of the world doing a survey of a larger set of programs.

nwf avatar Sep 22 '25 14:09 nwf

The other thing to note: almost 100% of cases we've seen (we'll prod the compiler to try to make this more quantitative) of the cases are ones where the source and destination registers are (or can be) the same. This means that we could make it an 10-bit immediate with source and destination registers the same, in the same encoding space. If ARC would approve that, an 8-bit mantissa (biased by 1) and 2-bit exponent might cover all cases. That would probably eliminate the need for Xycheriot to do anything special.

davidchisnall avatar Sep 22 '25 14:09 davidchisnall

This is a great suggestion and we should ask ARC immediately. In general the register will either be modified in-place or be the result of a addiy instruction, so I imagine 90+% of cases use the same input and output register. It should be quite simple to write a script that looks at some large binaries and see how many csetbounds use the same input and output registers.

So to clarify, you suggestion would be imm = (imm[7:0] + 1) << imm[9:8] or would the shift value be biases in some way for non-zero? imm = (imm[7:0] + 1) << (imm[9:8] == 0 ? 0 : imm[9:8] + 3)

arichardson avatar Sep 22 '25 15:09 arichardson

So to clarify, you suggestion would be imm = (imm[7:0] + 1) << imm[9:8] or would the shift value be biases in some way for non-zero? imm = (imm[7:0] + 1) << (imm[9:8] == 0 ? 0 : imm[9:8] + 3)

I think the suggestion is to offer the values {1, 2, ..., 255, 256, 258, 230, ... 766, 768, 772, 776, ..., 1788, 1792, 1800, 1808, ..., 3832, 3840}. That's

((imm[7:0] + 257) << m[9:8]) - 256

(edit: longer equivalent formulae removed)

nwf avatar Sep 22 '25 16:09 nwf

It should be quite simple to write a script that looks at some large binaries and see how many csetbounds use the same input and output registers.

It's not quite that, because sometimes the destination is an ABI-mandated thing and the source is whatever the register allocator picked. @resistor is going to prod it to report when the source register is a kill, which is the case when we could have made them the same without introducing a move.

So to clarify, you suggestion would be imm = (imm[7:0] + 1) << imm[9:8]

That's my initial thought. Happy for someone to propose something better though. We're also going to do a bit more work on the distributions to make sure that this is vaguely sensible. @nwf points out that our encoding has a 9-bit mantissa and so our current 11-bit immediate or a simple 10-bit immediate can express things that the encoding cannot represent, which makes them inefficient ways of encoding a clear-tag instruction. A 1-bit exponent and a 9-bit mantissa would be aligned with the encoding, but I think (based on intuition that I want to replace with data) that the above formulation would cover more useful things (I have a strong suspicion that most know-at-compile-time things that are greater than 256 are at least multiples of two, but I'm not 100% sure).

Our ABI currently uses this instruction heavily in two places:

  • On-stack allocations, where we can (at the expense of a temporary register and 1-2 extra instructions) fall back to the register-based version.
  • Globals, where we do cincoffset on cgp (or the result of an AUICGP, which is almost always relaxed away by the linker) then a csetbounds.

As I recall, the CheriBSD ABI always uses the captable for address-taken globals. We probably should do that for cases where globals are referenced enough times that the auipcc + clc sequence, multiplied by the number of uses, plus the size of a capability, is less than the sequence to materialise them in place. We could also potentially move them to a second section that isn't reachable from gp for that, so that very large globals don't cause us to lose range in gp. But I don't think that impacts this instruction.

davidchisnall avatar Sep 22 '25 16:09 davidchisnall

I think the issue is whether it fits with an existing 32-bit instruction format, none have src/dsts because they all have rs1, rs2 and rd in fixed positions so I doubt this will fly 😢 the closest is I-type. We'll get resistance if we try to introduce a src/dst I'm sure.

Image

We can get a longer immediate by introducing a second encoding , that could be a compromise.

tariqkurd-repo avatar Sep 22 '25 16:09 tariqkurd-repo

I think the issue is whether it fits with an existing 32-bit instruction format, none have src/dsts because they all have rs1, rs2 and rd in fixed positions so I doubt this will fly 😢 the closest is I-type. We'll get resistance if we try to introduce a src/dst I'm sure.

Image We can get a longer immediate by introducing a second encoding , that could be a compromise.

Weren't we already "misusing" an R-type encoding and replacing rs2 with an immediate? Is replacing both rs2 and rs1 worse?

arichardson avatar Sep 22 '25 16:09 arichardson

Weren't we already "misusing" an R-type encoding and replacing rs2 with an immediate? Is replacing both rs2 and rs1 worse?

There's precedent for stealing one elsewhere in RISC-V, with the shift instructions (slli, srli, srai) and (at a lightly educated glance) in the Vector goo. There's also precedent for requiring rs1 = rd in the compression scheme. But I am unaware of precedent for doing both at once or otherwise stealing both rs1 and rs2 in nominally R-type instructions.

ETA: That said, I really like the idea of a (8+2) biased floating point scheme, if we can get it...

nwf avatar Sep 22 '25 16:09 nwf

xref to my proposed solution: https://github.com/riscv/riscv-cheri/issues/800#issuecomment-3322994024

tariqkurd-repo avatar Sep 23 '25 09:09 tariqkurd-repo

It's not quite that, because sometimes the destination is an ABI-mandated thing and the source is whatever the register allocator picked. @resistor is going to prod it to report when the source register is a kill, which is the case when we could have made them the same without introducing a move.

My initial analysis on the CHERIoT RTOS code base is that the source register is dead after csetbounds (imm) in 99.5% of the instances where it is emitted by the compiler. All instances where we emit it in the linker almost meet that criterion, so the final (static) percentage will be higher than that.

I would also note that toolchains are generally pretty capable at optimizing around dst = src constraints (due to needing to support x86), so I would not expect the compiler to have trouble on the remaining 0.5%.

resistor avatar Sep 23 '25 10:09 resistor

It's not quite that, because sometimes the destination is an ABI-mandated thing and the source is whatever the register allocator picked. @resistor is going to prod it to report when the source register is a kill, which is the case when we could have made them the same without introducing a move.

My initial analysis on the CHERIoT RTOS code base is that the source register is dead after csetbounds (imm) in 99.5% of the instances where it is emitted by the compiler. All instances where we emit it in the linker almost meet that criterion, so the final (static) percentage will be higher than that.

I would also note that toolchains are generally pretty capable at optimizing around dst = src constraints (due to needing to support x86), so I would not expect the compiler to have trouble on the remaining 0.5%.

Thanks for this data, that is great! I'll send an email to ARC and ask for their opinion on this.

arichardson avatar Sep 23 '25 16:09 arichardson

https://github.com/riscv/riscv-cheri/pull/809

tariqkurd-repo avatar Sep 25 '25 10:09 tariqkurd-repo