riscv-isa-manual icon indicating copy to clipboard operation
riscv-isa-manual copied to clipboard

Temporal locality HINTs

Open aswaterman opened this issue 4 years ago • 36 comments

It's sometimes desirable to express that a load or store has poor temporal locality, so that the microarchitecture can optimize its use of the memory hierarchy accordingly. Temporal locality hints are primarily of interest to the vector extension, but this mini-proposal defines them in such a way that they can apply to scalar memory accesses, too. Remote AMOs for contended synchronization variables are one obvious application of the scalar form.

Limited encoding space makes it impractical to add a locality hint field to the various loads and stores. If we went that route, we'd either need to use 48-bit instructions or greatly restrict the addressing modes (e.g., register-indirect only, and, for vectors, unit-stride only). So the proposal is to immediately precede the load or store with a HINT instruction, which essentially acts as a fusable prefix.

For now, I'll call it ntlh, for "non-temporal locality hint". It takes no arguments.

We should provide a 16-bit variant to reduce impact on code size: perhaps c.nop 1. I'm not sure whether a 32-bit form is really necessary. If it is, we probably don't want to use the natural expansion of c.nop 1, which is addi x0, x0, 1, because we'd like to preserve those I-type HINTs for things that use the whole 12-bit immediate. An alternative would be to choose c.mv x0, x3 and its natural expansion add x0, x0, x3. (x3/gp is almost never written so spurious interlocks won't be a problem.)

aswaterman avatar Aug 07 '20 02:08 aswaterman

This is a great idea, especially given its ready applicability to all load/store instructions present and future. And this is a well established idea, i.e. of having non-temporal load/store instructions in an ISA.

To me a 16-bit hint variant would be sufficient. Low-end embedded designs will often be using the C extension. Equally Linux class processors will be using the C extension since RV64GC is the architectural baseline.

gfavor avatar Aug 07 '20 02:08 gfavor

does this mean every non-temporal load/store will need a prefix? if so it really should be 16-bit, and maybe in the future we can fuse the two into a single 48-bit instruction.

tariqkurd-repo avatar Aug 07 '20 09:08 tariqkurd-repo

@tariqkurd-repo Yeah, there's no debate as to whether we should supply a 16-bit variant; the only question is whether a 32-bit prefix should be provided in addition so that the feature can be used without the C extension. I tend to agree with @gfavor that we can skip the 32-bit prefix.

aswaterman avatar Aug 07 '20 09:08 aswaterman

I too would lean toward miserly allocation of hints. An implementation that requires a 32 bit variant has numerous options.

  • It can use 32 bit custom space. 32bit only machines are envisioned for custom experimentation after all.
  • It can use any 32 bit hint custom or reserved. This would be specific to the implementation after all.
  • It could make an exception for a pair of ntlh to be considered the 32 bit variant. This is somewhat at odds with a design using 32bit only code to simplify the decoder, however the alignment matters are still eliminated.

The big advantage to 16bit only is simplifying the software stack support. Only a single construct is needed.

The big disadvantage to 16bit only is the complication of the current software stack for C extension support which supports this as a separate compression pass. I suggest the first stage provides a pair of ntlh hints, which the compression pass reduces to a single one if it wants to. (the pair of hints would still work). I would expect the inclusion of the hint would be a compile option that the ILEN32 only target would avoid. There could be a clean up pass to convert the pair to a standard 32bit nop for ILEN32.

David-Horner avatar Aug 07 '20 11:08 David-Horner

"32bit only machines are envisioned for custom experimentation after all." This is news to me.

On Fri, Aug 7, 2020 at 4:48 AM David-Horner [email protected] wrote:

I too would lean toward miserly allocation of hints. An implementation that requires a 32 bit variant has numerous options.

  • It can use 32 bit custom space. 32bit only machines are envisioned for custom experimentation after all.
  • It can use any 32 bit hint custom or reserved. This would be specific to the implementation after all.
  • It could make an exception for a pair of ntlh to be considered the 32 bit variant. This is somewhat at odds with a design using 32bit only code to simplify the decoder, however the alignment matters are still eliminated.

The big advantage to 16bit only is simplifying the software stack support. Only a single construct is needed.

The big disadvantage to 16bit only is the complication of the current software stack for C extension support which supports this as a separate compression pass. I suggest the first stage provides a pair of ntlh hints, which the compression pass reduces to a single one if it wants to. (the pair of hints would still work). I would expect the inclusion of the hint would be a compile option that the ILEN32 only target would avoid. There could be a clean up pass to convert the pair to a standard 32bit nop for ILEN32.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-isa-manual/issues/556#issuecomment-670477992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHPXVJVYFXP5VWJVPQSYGTTR7PSY3ANCNFSM4PXF3TUA .

allenjbaum avatar Aug 07 '20 18:08 allenjbaum

On Fri, Aug 7, 2020, 14:27 Allen Baum, [email protected] wrote:

"32bit only machines are envisioned for custom experimentation after all." This is news to me.

Somewhat tounge in cheek, but.... The early idea was that research especially in universities and undergrad courses would use the 32bit without C extension to facilitate chip design unencumbered by instruction halfword alignment and parcels. Of course other applications are possible, however the early design envisioned C and reserved the opcode space for it, intentionally restricting immediate size among other aspects to allow for it. The evolution of lui and auipc in particular is indicative of the prior commitment to the C extension and the accommodations made to support it in practice.

David-Horner avatar Aug 07 '20 19:08 David-Horner

On Fri, Aug 7, 2020 at 4:48 AM David-Horner [email protected] wrote:

The big disadvantage to 16bit only is the complication of the current software stack for C extension support which supports this as a separate compression pass.

Is the current software stack not able to understand instructions that happen to be 16-bit in length (before the compression pass)? This instruction in any case is not going to be generated by a compiler, but presumably is introduced into the code via assembly code or use of an intrinsic. The compiler itself doesn't care about the instruction length and, with or without use of the compression pass, the assembler and linker should happily handle this 16-bit instruction mixed in with all the other 32-bit instructions. (Or am I missing something?)

Although is that maybe the wrong way to view the matter. It seems like the toolchain needs to view the "hint; ld/st" instruction pair as a single pseudo-instruction (for lack of a better term) that later (at assembly/link time) is expanded into a 48-bit "instruction" or into a pair of 16b+32b instructions (the same end result in any case) - so that the hint and ld/st parts of this thing are kept together. Or alternatively the "hint; ld/st" instruction pair are understood by the compiler as a potentially fuseable pair of instructions that are kept together through all the compiler's optimization passes.

Put differently, imagine there is no C extension and we're simply adding this one 16-bit instruction to the ISA. What in the software stack would have a problem with supporting that?

Greg

gfavor avatar Aug 07 '20 19:08 gfavor

Re: [riscv/riscv-isa-manual] Temporal locality HINTs (#556)

On 2020-08-07 3:26 p.m., gfavor wrote:

On Fri, Aug 7, 2020 at 4:48 AM David-Horner [email protected] wrote:

The big disadvantage to 16bit only is the complication of the current software stack for C extension support which supports this as a separate compression pass.

Is the current software stack not able to understand instructions that happen to be 16-bit in length (before the compression pass)? correct. This instruction in any case is not going to be generated by a compiler, but presumably is introduced into the code via assembly code or use of an intrinsic. These too currently only know ILEN=32. The compiler itself doesn't care about the instruction length The simplification of 4bytes per instruction, aligned on 32 bit boundaries is significant and, with or without use of the compression pass, the assembler and linker should happily handle this 16-bit instruction mixed in with all the other 32-bit instructions. Perhaps they would be overjoyed, but currently the blissful state of C awareness is not (significantly) coded. (Or am I missing something?)

Because the C extension has been designed so well that the instructions correspond to a single base RVI instruction,  there has been no need for the front end to be C aware. This allows rapid implementation of the toolchain with substantial effectiveness of the Compressed code via replacement in a near the end pass. Up until now the pressure to do better has not been sufficient to make the early components C aware, although there has been much desire to do so. Alex can speak to this much better than I.

Although is that maybe the wrong way to view the matter. It seems like the toolchain needs to view the "hint; ld/st" instruction pair as a single pseudo-instruction (for lack of a better term) that later (at assembly/link time) is expanded into a 48-bit "instruction" or into a pair of 16b+32b instructions (the same end result in any case) - so that the hint and ld/st parts of this thing are kept together. Yes. there is a need to keep the prefix and its partner in order and together.    Multiple approaches are possible to do this. Or alternatively the "hint; ld/st" instruction pair are understood by the compiler as a potentially fuseable pair of instructions that are kept together through all the compiler's optimization passes. It is entirely possible for the compiler/assembler to refer to a psuedo-instruction that is this instruction-pair. From my cursory reading of LLVM components this is not uncommon.

Put differently, imagine there is no C extension and we're simply adding this one 16-bit instruction to the ISA. What in the software stack would have a problem with supporting that? The support is not currently there, and as I understand it, it would be a big internal change for both LLVM and gcc (and thus for other home growers).

Greg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-isa-manual/issues/556#issuecomment-670675799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKLOA5BBTCO6TLOHAUTR7RIMXANCNFSM4PXF3TUA.

David-Horner avatar Aug 07 '20 19:08 David-Horner

On Fri, Aug 7, 2020 at 12:54 PM David-Horner [email protected] wrote:

It is entirely possible for the compiler/assembler to refer to a psuedo-instruction that is this instruction-pair. From my cursory reading of LLVM components this is not uncommon.

Thanks for the explanation. Is the preceding suggesting that a pseudo-instruction approach would avoid all the complications that you mentioned wrt the current software stack, i.e. that it happens late enough in the overall flow?

If so, then that would enable us to only define a 16-bit 'ntlh' instruction. Otherwise it sounds like it will be very desirable (to say the least) to have a 32-bit version of the instruction as well.

Greg

gfavor avatar Aug 07 '20 20:08 gfavor

The support is not currently there, and as I understand it, it would be a big internal change for both LLVM and gcc (and thus for other home growers).

I am extremely skeptical of this argument since both llvm-mc and gas are multitarget assemblers which support many variable-width instruction sets, e.g. x86, Thumb, s390x, microMIPS, POWER (VLE but also prefixed instructions are handled as single 8-byte instructions by LLVM), etc. Can someone actually familiar with the code confirm or deny?

sorear avatar Aug 07 '20 20:08 sorear

On Fri, Aug 7, 2020, 16:04 gfavor, [email protected] wrote:

On Fri, Aug 7, 2020 at 12:54 PM David-Horner [email protected] wrote:

It is entirely possible for the compiler/assembler to refer to a psuedo-instruction that is this instruction-pair. From my cursory reading of LLVM components this is not uncommon.

Thanks for the explanation. Is the preceding suggesting that a pseudo-instruction approach would avoid all the complications that you mentioned wrt the current software stack, i.e. that it happens late enough in the overall flow?

No. Even then current assumption is all 32 bit opcodes.

If so, then that would enable us to only define a 16-bit 'ntlh' instruction. Otherwise it sounds like it will be very desirable (to say the least) to have a 32-bit version of the instruction as well.

Except we can simulate the 32bit opcode by providing the encoding of two successive idential 16bit codes.

Greg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-isa-manual/issues/556#issuecomment-670691086, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKJUB3JKHMOQYDHK4OLR7RM5NANCNFSM4PXF3TUA .

David-Horner avatar Aug 07 '20 20:08 David-Horner

I am confident Alex Bradbury can. I have been following his LLVM Weekly updates. I am particularly interested in the C aware efforts and I would be surprised if I missed any significant advances. But it is certainly possible.

On Fri, Aug 7, 2020, 16:12 sorear, [email protected] wrote:

The support is not currently there, and as I understand it, it would be a big internal change for both LLVM and gcc (and thus for other home growers).

I am extremely skeptical of this argument since both llvm-mc and gas are multitarget assemblers which support many variable-width instruction sets, e.g. x86, Thumb, s390x, microMIPS, POWER (VLE but also prefixed instructions are handled as single 8-byte instructions by LLVM), etc. Can someone actually familiar with the code confirm or deny?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-isa-manual/issues/556#issuecomment-670694174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKIYH2WYZ3UTBK5MFCTR7RN2HANCNFSM4PXF3TUA .

David-Horner avatar Aug 07 '20 20:08 David-Horner

On Fri, Aug 7, 2020 at 1:14 PM David-Horner [email protected] wrote:

If so, then that would enable us to only define a 16-bit 'ntlh' instruction. Otherwise it sounds like it will be very desirable (to say the least) to have a 32-bit version of the instruction as well.

Except we can simulate the 32bit opcode by providing the encoding of two successive idential 16bit codes.

If I understand correctly, then there would be a 32-bit ntlh "instruction" (encoded as you described) that the toolchain understands all the way down to the compression pass - which then eliminates one of the 16-bit ntlh's and leaves the final 48-bit instruction pair in place. Along the way the compiler would need to understand to keep these two 32-bit "instructions" together. And if the compression pass was bypassed for whatever reason, the 64-bit end result would still execute fine (i.e. the first ntlh would effectively be ignored).

Yes?

Greg

gfavor avatar Aug 07 '20 20:08 gfavor

On 2020-08-07 4:29 p.m., gfavor wrote:

On Fri, Aug 7, 2020 at 1:14 PM David-Horner [email protected] wrote:

If so, then that would enable us to only define a 16-bit 'ntlh' instruction. Otherwise it sounds like it will be very desirable (to say the least) to have a 32-bit version of the instruction as well.

Except we can simulate the 32bit opcode by providing the encoding of two successive idential 16bit codes.

If I understand correctly, then there would be a 32-bit ntlh "instruction" (encoded as you described) that the toolchain understands all the way down to the compression pass - which then eliminates one of the 16-bit ntlh's and leaves the final 48-bit instruction pair in place. Along the way the compiler would need to understand to keep these two 32-bit "instructions" together. And if the compression pass was bypassed for whatever reason, the 64-bit end result would still execute fine (i.e. the first ntlh would effectively be ignored).

Yes? correct.

Greg

While we wait on confirmation this strongly implies what I believe to be true in LLVM at least:

[RISCV] Compress instructions based on function features

Jan 24, 2020.

When running under LTO, it is common to not specify the architecture spec, which is used for setting up the target machine, and instead rely on features specified in each function to generate the correct instructions.

This works for the code generator, but the RISC-V backend uses the AsmPrinter to do instruction compression, which does not see these features but instead uses a MCSubtargetInfo object to see whether compression is enabled. Since this is configured based on the TargetMachine at startup, it will result in compressed instructions not being emitted when it has not been given the 'c' TargetFeature, but the function has it.

This changes the RISCVAsmPrinter to re-initialize the STI feature set based on the current MachineFunction, such that compressed instructions are now correctly emitted regardless of the method used to enable them.

David-Horner avatar Aug 07 '20 20:08 David-Horner

On Fri, Aug 7, 2020 at 1:12 PM sorear [email protected] wrote:

The support is not currently there, and as I understand it, it would be a big internal change for both LLVM and gcc (and thus for other home growers).

I am extremely skeptical of this argument since both llvm-mc and gas are multitarget assemblers which support many variable-width instruction sets, e.g. x86, Thumb, s390x, microMIPS, POWER (VLE but also prefixed instructions are handled as single 8-byte instructions by LLVM), etc. Can someone actually familiar with the code confirm or deny?

I don't know about llvm-as, but for GNU as it just maps a string to an encoding. There is no need for a 16-bit instruction to have a 32-bit equivalent. The only issue would be if the operands don't fit, then we can't back off and use a 32-bit instruction instead, but have to emit an error. But we already do exactly the same thing if you use a compressed opcode like c.add. If you use the opcode add, then we emit a compressed instruction if possible, and a 32-bit instruction otherwise. If you use c.add, then you get an error if the compressed instruction is not possible. It is the user's responsibility to get it right if they use c.add.

Jim

jim-wilson avatar Aug 07 '20 20:08 jim-wilson

I just added support for a 16-bit instruction with no 32-bit counterpart to LLVM, with full support in the assembler (llvm-mc), disassembler (llvm-objdump), and inline asm in clang. It took me 11 minutes, including build time and the time I spent figuring out the arguments to llvm-mc which are apparently very different from gas. I have never modified the LLVM assembler before.

diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoC.td b/llvm/lib/Target/RISCV/RISCVInstrInfoC.td
index f687678..5c382b6 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoC.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoC.td
@@ -576,6 +576,12 @@ def C_UNIMP : RVInst16<(outs), (ins), "c.unimp", "", [], InstFormatOther>,
   let Inst{15-0} = 0;
 }
 
+let hasSideEffects = 1, mayLoad = 0, mayStore = 0 in
+def C_ASDFTEST : RVInst16<(outs), (ins), "c.asdftest", "", [], InstFormatOther>,
+              Sched<[]> {
+  let Inst{15-0} = 4;
+}
+
 } // Predicates = [HasStdExtC]
 
 //===----------------------------------------------------------------------===//

sorear avatar Aug 07 '20 20:08 sorear

Once upon a time, it was true that GAS didn’t have explicitly compressed instructions, but as Jim points out, it’s not at all a limitation anymore. LLVM can also definitely be made OK with this.

The decision to include a 32-bit version of the hint really only needs to be made on the basis of supporting the hint in systems without RVC.

aswaterman avatar Aug 07 '20 21:08 aswaterman

I think we should have a general rule that when supporting extensions that nominally have 16-bit or 48-bit instructions, if C is not also enabled they must be padded to a 4-byte boundary with a c.nop. This came up years ago in the context of the L extension, although I forget which forum.

sorear avatar Aug 07 '20 21:08 sorear

@sorear Excellent.

Do you know if LLVM pads a nop to ensure 32bit alignment for the other RISCV instructions? Is it before or after c.asdftest? For c.ntlh after would be problematic, unless the nop were removed during the C compression pass.

The ability to place such a 16bit instruction is of considerable benefit to proposed C enhancement for multiple register stores load/pops. The list discussion about custom instructions to reduce code size.

David-Horner avatar Aug 07 '20 21:08 David-Horner

No, it just leaves the instruction stream unaligned. That would have to be added, but it's a logically separate feature and only needed to support a narrow class of cores.

sorear avatar Aug 07 '20 21:08 sorear

All 16-bit instructions should have a 32-bit equivalent to support IALIGN=32. If it can fit/be justified as a 16-bit encoding, it is easy to fit/justify the 32-bit encoding. This avoids having to later design an IALIGN=32 equivalent (and having to repeat the above conversation on every 16b instruction addition). The toolchain issues are not the issue, rather there are multiple reasons to not abandon/fragment IALIGN=32 designs.

kasanovic avatar Aug 08 '20 20:08 kasanovic

If a hinted load or store traps, where does xepc point?

Does it point at the hint, forcing modification to all code which emulates loads and stores for alignment, virtual MMIO, or other reasons? If the H extension is also implemented, what value is stored in xtinst?

Or does it point at the load/store proper, causing the load/store to be treated as temporal when restarting after a page fault? If data-modifying cache maintenance operations are also implemented, can this have a different effect observable beyond timing?

sorear avatar Aug 09 '20 07:08 sorear

On 2020-08-09 3:25 a.m., sorear wrote:

If a hinted load or store traps, where does xepc point?

Does it point at the hint,

no.

forcing modification to all code which emulates loads and stores for alignment, virtual MMIO, or other reasons?

it is desirable to avoid this, and fortunately pointing at the "hinted" instruction can avoid all manner of complication.

If the H extension is also implemented, what value is stored in xtinst?

It should be the same as without the hint.

Or does it point at the load/store proper,

yes.

causing

(it need not cause, the Uarch is the determinant of what happens)

the load/store to be treated as temporal when restarting after a page fault?

maybe.

It is allowed for the implementation to check the hint-prefix on return from the interrupt.

It is also allowed for an implementation to ignore the hint at any time,   or to provide equivalent hint behaviour even when the hint instruction is not present. For example on return from an exception,   or when the Uarch has determined a pattern of behaviour that conflicts with the explicit hint or      suggest the hint behaviour is warrented even in the absence of an explicit hint.

If data-modifying cache maintenance operations are also implemented,

executed during the exception handling? (I talk to two readings of "are ... implemented" below)

can this have a different effect observable beyond timing?

The hint could change timing and cache behaviour sufficiently for observable effects on concurrent processes' access to the same data/cacheline/page. This is precisely the intent of the hint.

Depending upon the "data-modifying cache maintenance operations" executed in an exception handler, anything is possible. It is incumbent upon the exception handler to avoid any activity that disrupts the architecturally visible "normal" execution of the interrupted process.

Some disruption on an exception will occur, but the behaviour that a hint invokes cannot be beyond what is allowed for an implementation in the absence of an explicit hint.

Stated another way, an implementation may wrongly allow certain behaviour but a hint is not justification for such.

David-Horner avatar Aug 09 '20 08:08 David-Horner

Chill. The HINT is just a hint. It’s architecturally equivalent to executing a nop. Timing channels could’ve been effected through other means.

aswaterman avatar Aug 09 '20 10:08 aswaterman

@aswaterman re chill. I see @sorear questions as valid and useful for clarification as the intent of this hint is as an instruction prefix. It thus conjours the x86 prefix byte managary with its unintended consequences and complications. It is good to fully understand the implications. Without raising plausible and challenging alternatives clarity and deep understanding rarely occurs. Scrutiny of the intended affect is neccessary to ensure a hint is just a hint.

David-Horner avatar Aug 09 '20 12:08 David-Horner

I'll take a stab at arguing for the acceptability of the "compromise" approach, i.e. trapping just the hinted instruction.

If hinted instructions are trapped commonly, then performance is going to really, really suck due to the trap cost. At which point ignoring the hint on return is not going to materially/qualitatively change the matter.

Conversely, if such traps are rare enough to be acceptable performance-wise, then the impact of the compromise should also be acceptable.

Greg

gfavor avatar Aug 09 '20 15:08 gfavor

None of my questions were directed at David-Horner and I still think they are important questions for the privileged architecture and CMO people. We specified caches to be 100% software transparent in the originally frozen specs for a reason…

sorear avatar Aug 09 '20 18:08 sorear

A question regarding the hint: Does the hint apply to the next Load op, or to the next sequential instruction?

If it is really important, then the handler could return to xEPC-2 or -4 instead of xEPC

-Allen

On Aug 9, 2020, at 8:35 AM, gfavor [email protected] wrote:

I'll take a stab at arguing for the acceptability of the "compromise" approach, i.e. trapping just the hinted instruction.

If hinted instructions are trapped commonly, then performance is going to really, really suck due to the trap cost. At which point ignoring the hint on return is not going to materially/qualitatively change the matter.

Conversely, if such traps are rare enough to be acceptable performance-wise, then the impact of the compromise should also be acceptable.

Greg — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

allenjbaum avatar Aug 10 '20 02:08 allenjbaum

There is no way for a handler to distinguish between a load/store that happens to be prefixed and a load/store that is the first instruction in a function and follows rodata that has the same bit pattern as a prefix, or perhaps a U-type instruction the last 16 bits of which happen to match the prefix. We hit this already with trying to use prefixes to distinguish semihosting ebreaks from hosted debugging ebreaks.

sorear avatar Aug 10 '20 02:08 sorear

Even if it were possible, it wouldn't be worth making the ISR go out of the way to return to the HINT instead of the load/store. One additional cache miss following an exception is not a big deal. The HINT is useful even if it only works in the common case.

The code-generation recommendation will be to place the HINT immediately before the load or store, so that it can be fused. Under those constraints, it wouldn't matter much if it were defined to affect the next instruction only or the eventual next memory-access instruction. In fact, different machines could legitimately employ either strategy, since there's no architecturally visible effect.

aswaterman avatar Aug 10 '20 02:08 aswaterman