riscv-cfi PLT stubs with landing pad

trafficstars

Does here suggested PLT stub sequence? I assume we need landing pad and setup the label in PLT stub, and maybe something like that:

    /* Landing pad.  */
    lpcll
    lpcml
    lpcll
1:  auipc   t3, %pcrel_hi([email protected])
    l[w|d]  t3, %pcrel_lo(1b)(t3)
    /* Setup lading pad register.  */
    lpsll
    lpsml
    lpsul
    jalr    t1, t3
    /* For padding to 16 byte.  */
    nop
    nop
    nop

And because the call in PLT stub (jalr t1, t3) will treat as indirect call, so all function with external visibility in DSO/executable must having lading pad?

Jun 05 '23 02:06 kito-cheng

The PLT function may be invoked directly or indirectly. If invoked directly then the linker should link to label "1". Rest of sequence looks right. The dynamic loader should disable forward CFI for the process if the DSO/executable to be linked does not have the same CFI attributes.

Compilers should flag each object file (for example, using flags in the elf attributes) to indicate if the object file has been compiled with the Zicfisslp instructions. The linker should flag (for example, using flags in the elf attributes) the binary/executable generated by linking objects as being compiled with the Zicfisslp only if all the object files that are linked have the same Zicfisslp attributes.

The dynamic loader should enable the use of Zicfisslp extension for a process only if all executables (the application and the dependent dynamically linked libraries) used by that process have the same Zicfisslp attributes. When the use of the extension is not enabled for a process then the Zicfisslp instructions in that application or in the dynamically linked libraries invoked by that process revert to their Zimop/Zcmop defined behavior. This allows the program to functionally execute but without control-flow integrity.

A process that has the Zicfisslp extension enabled may request the dynamic loader at runtime to load a new dynamic shared object (using dlopen() for example). If the requested object does not have the Zicfisslp attribute then the dynamic loader, based on its policy (e.g, established by the operating system or the administrator) configuration, either fail the request or disable the extension for the process. If the extension is disabled then the Zicfisslp instructions revert to their Zimop/Zcmop defined behavior and the program continues to functionally execute but without control-flow integrity.

Jun 05 '23 16:06 ved-rivos

Please also see this issue #104 and this PR #108 Will merge PR #108 shortly and regenerate the PDF.

Jun 05 '23 16:06 ved-rivos

One question from me is the dynamic loader (e.g. ld-linux-riscv64-lp64d.so.1) is already live in user space/U-mode, so dynamic loader should not have enough permission to enable/disable CFI?

Jun 06 '23 02:06 kito-cheng

Or maybe you mean the ELF loader in kernel space? that didn't have capability to resolve dependency of shared libraries IIRC?

Jun 06 '23 02:06 kito-cheng

I think Ved meant ld-linux-riscv64-lp64d.so.1 should be the deciding whether to enable cfi for entire process or not. It'll have to make that decision based on whether all libraries/binaries in address space are compiled with cfi. If any binary/library is found without cfi, based on policy ld can either terminate or disable cfi for entire process.

It (ld) can enable / disable by issuing a syscall (let's say a prctl, there is already one for x86 CET). Again based on policy such a decision can be made sticky. i.e. once enabled, it can't be disabled for that process.

It's complicated task for kernel to manage and track that all libraries in address space are compiled with cfi. So that's the reason decision deferred to ld is probably the best way to go.

Jun 06 '23 17:06 deepak0414

Yes, I meant the ld.so. Also for added security, the programs using CFI would be recommended to be relocation read-only (-z,relro,-z,now)

Jun 06 '23 17:06 ved-rivos

@deepak0414 Thanks for the x86 info, it's really useful!

@ved-rivos I guess we can add some note on psABI land when zicfi is ready, and I will keep this open until PLT stubs is finalized.

Jun 12 '23 02:06 kito-cheng

With the updated landing pad encoding, the PLT stub should look as follows:

 /* Landing pad.  */
    lpad $label
1:  auipc   t3, %pcrel_hi([email protected])
    l[w|d]  t3, %pcrel_lo(1b)(t3)
    /* Setup lading pad register.  */
    lui x7, $label
    jalr    t1, t3
    /* For padding to 16 byte.  */
    nop
    nop
    nop

Jun 19 '23 13:06 ved-rivos

lpad doesn't clobber x7 (which we should probably write as t2 for the sake of people wondering if the other instructions write to it), so there is no need to reload x7 before the jump and we can continue to use 16 byte PLT entries. We can also use lpad 0 in PLT entries with no loss of security, since x7 is always going to be checked after the jalr before any potentially attacker-controlled memory accesses, although if the linker knows the correct label for the external symbol and can generate a more precise lpad that's also fine and will result in slightly earlier error detection.

The lazy resolve header is more problematic (remember, x7 === t2):

 auipc  t2, %hi(.got.plt)
 sub    t1, t1, t3               # shifted .got.plt offset + hdr size + 12
 l[w|d] t3, %lo(.got.plt)(t2)    # _dl_runtime_resolve
 addi   t1, t1, -(hdr size + 12) # shifted .got.plt offset
 addi   t0, t2, %lo(.got.plt)    # &.got.plt
 srli   t1, t1, log2(16/PTRSIZE) # .got.plt offset
 l[w|d] t0, PTRSIZE(t0)          # link map
 jr     t3

t2 does not contain a useful value, and so the header can be rewritten as:

 lpad    0                        # need to fill in for any symbol
 auipc   t4, %hi(.got.plt)
 sub     t1, t1, t3               # shifted .got.plt offset + hdr size + 16
 l[w|d]  t3, %lo(.got.plt)(t4)    # _dl_runtime_resolve
 addi    t1, t1, -(hdr size + 16) # shifted .got.plt offset
 addi    t0, t4, %lo(.got.plt)    # &.got.plt
 srli    t1, t1, log2(16/PTRSIZE) # .got.plt offset
 l[w|d]  t0, PTRSIZE(t0)          # link map
 jr      t3 
 .balign 64

however, we have a problem because there are no normative constraints on _dl_runtime_resolve. glibc's _dl_runtime_resolve contains the following (incorrect) comment:

/* Assembler veneer called from the PLT header code for lazy loading.
   The PLT header passes its own args in t0-t2.  */

but glibc's _dl_runtime_resolve does not actually use t2, nor does FreeBSD's _rtld_bind_start, OpenBSD's _dl_bind_start, or NetBSD's _rtld_bind_start. musl and (as far as I can tell) bionic do not support lazy binding and the other listed operating systems do not do dynamic linking at all so I think this covers all dynamic linkers on the RISC-V Software Ecosystem spreadsheet.

The runtime resolver needs to be internally modified to save t2 on the stack and restore it before jumping to the resolved symbol; t2 will be checked at latest by the symbol's own landing pad after it is resolved. _dl_runtime_resolve itself is invoked with an unconstrained value in t2, which means that it must use lpad 0, which means that the t0 and t1 arguments should be treated as potentially attacker-controlled and validated before use.

Jun 25 '23 17:06 sorear

So was your thinking that calls to PLT functions would be treated specially and that the compiler would emit a lui x7, $label before even a direct call to PLT function? If so then yes the reload of the x7 would not be required since whether the PLT function was invoked directly or indirectly, when it makes the tail call to the linked function x7 would always have the right label.

Jun 25 '23 18:06 ved-rivos

Tricky. Ignore what I just said about reloading t2 in PLT entries (the analysis of the PLT header stands), it wasn't properly thought through.

There are, I think, two big ABI questions which need to be answered first:

Do functions have two entry points? If a function has a landing pad (essential for address-taken functions or functions with non-hidden visibility which might have their address taken in another shared object or by dlsym), then a direct call to the beginning of the function needs t2 set up, wasting space and time. We can save this by pointing direct calls after the landing pad, but that requires the static linker to know which functions have landing pads.

If functions have two entry points: CALL(_PLT), JAL, BRANCH, RVC_JUMP and RVC_BRANCH relocations get the address after the landing pad (this requires an ISA change: #125), all other relocations get the address of the landing pad. The linker inspects the text segment to determine the "address after the landing pad", skipping 4 bytes if the 4 bytes at the symbol are xxxx_x017. This is the preferred option due to the large code size reduction.

An alternative would allocate a bit in st_other and use that instead of inspecting the text; POWER has a similarly shaped feature where some functions have a "global entry point" which initializes r2 to the GOT and a "local entry point" which assumes r2 is already initialized, and allocates three bits from st_other to encode the distance between the two entry points. Due to the disadvantages of duplicating information and requiring one of the five unused st_other bits, I prefer the text inspection option.

If functions have one entry point: No linker changes are needed. All local calls require t2 set up. This adds estimated 564 KiB to libQt5WebKit.so.5.212.0.
(ADDED) Do all non-STV_HIDDEN function symbols have a landing pad? I am very strongly leaning towards "yes"; any C function, C++ function, static method, template function, or operator can have its address taken from another module, which only leaves constructors and non-virtual destructors, and there aren't enough of those to matter. The only impact of a "yes" is code size for a small fraction of functions in non-C languages; a "no" answer would force a "yes" to the next question and additionally requires STO_RISCV_HAS_LP to be present at runtime.
Do control transfers from a PLT entry or -fno-plt pseudo-direct call pass through a subsequent landing pad? This is almost, but not quite, the same question as "do we enforce the use of -z now -z relro with forward-edge CFI": if .got.plt is potentially writable by an attacker, then pointers loaded from it must be validated, but if .got.plt is read-only at runtime then it can be trusted at least to the same level as a switch table, which we are already endorsing the use of software-guarded branches with.

If PLT entries and -fno-plt calls pass through a subsequent landing pad, then question 2 is forced to "yes" (but that was the preferred option already), and the dynamic linker can remain ignorant of the fact that functions have two entry points. The label must be communicated to the static linker, perhaps as the addend of a new relocation type, which makes static libraries compiled with landing pads incompatible with old linkers. The runtime cost is two integer instructions, the t2 load is likely to be free due to placement in a delay slot while the landing pad itself can be optimized by microarchitectures in a variety of ways. I currently prefer this option but quite weakly; suggestions welcome.

If PLT entries and -fno-plt calls bypass the landing pad, then the dynamic linker must be aware of the mechanism in use for determining the length of landing pads, returning the symbol address (address of the landing pad) for dlsym() and R_RISCV_64 but the address after the landing pad for bypassing calls. We would need at least two new relocation types for this to work (64_NOLP and GOT_NOLP_HI20; we may not need JUMP_SLOT_NOLP because lazy binding that bypasses landing pads is too dangerous to support without additional mechanisms to make the GOT writable only by the dynamic linker).

Subject to the answers: if functions have two entry points (thus, the PLT entry can be entered with an invalid t2) and PLT entries pass through a subsequent landing pad (requiring t2), then t2 must be reloaded. If the PLT entry is subject to indirect calls (which can only happen in the main program), it must start with a landing pad; it may be useful to distinguish indirect-only PLT entries (start with lpad 0, no reload) from direct-only PLT entries (loads t2, no landing pad).

Jun 25 '23 23:06 sorear

#126 obsoletes most of my last message. Since direct calls suppress t2 checking, there is no need, semantically, for double entry points, no need for STO_RISCV_HAS_LP, no need for attributes to define the interpretation of that flag. It is possible to handle all linking requirements using our existing semantics for relocations.

The needed "general" PLT stub, supporting lazy binding (with runtime CFI checks on the writable .plt.got) and both direct and indirect calls, is:

lpad   LABEL
auipc  t3, PCREL_HI20(JUMP_SLOT(symbol))
l[w|d] t3, t3, LO12(JUMP_SLOT(symbol))
lui    t2, LABEL
jalr   t4, t3

This assumes that the runtime resolver preserves the value in t2 and restores it before calling the resolved function. It also assumes that the label is available at static link time, either passed by the compiler or from the linker input files; if the label is not available, a split PLT entry must be created:

# indirect entry point - t2 is valid
lpad   0
auipc  t3, PCREL_HI20(JUMP_SLOT(symbol))
l[w|d] t3, t3, LO12(JUMP_SLOT(symbol))
jalr   t4, t3
# direct entry point
auipc  t3, PCREL_HI20(JUMP_SLOT(symbol))
l[w|d] t1, t3, LO12(JUMP_SLOT(symbol))
addi   t4, t3, LO12(JUMP_SLOT(symbol)) # needed for lazy only
jr     t1 # sw-guarded jump
# t4 in the PLT header will point into .plt if t2 is valid, .plt.got if not

Label checking for writable .plt.got is not possible in this case; -z relro -z now is strongly recommended. (If lazy resolution happens, the PLT header must communicate to the runtime resolver the need to bypass the label check, perhaps using a special value of t2.) The value of the undefined dynamic symbol points to the indirect entry point, which is also used for PCREL_HI20, 64, and 32 relocations; direct relocations all refer to the direct entry point.

If no relocations refer to the PLT indirect entry point, it may be omitted from the binary and the dynamic symbol value left as zero.

While not needed for correctness, I would like to leave the option open in the ABI for static and dynamic linkers to resolve relocations to symbol_value + 4 if the relocation type is a direct jump/call or JUMP_SLOT and the symbol is a function which begins with a landing pad. (I expect non-toy static linkers to use this option, musl not to, other dynamic linkers could go either way (likely not if execute-only memory is in use), and debuggers to need to cope with both.)

Jul 08 '23 07:07 sorear

Some dump from my brain: we use t2 as label register and we have lpad 0 to ignore the label value check, so I am thinking we can have this PLT entry like below, only insert lpad at beginning of PLT entry, the challenge is we need to fill the right label value at static link time, that should be possible if we put more information into ELF file.

    lpad <value> # t2/x7 is valid
1:  auipc   t3, %pcrel_hi([email protected])
    l[w|d]  t3, %pcrel_lo(1b)(t3)
    lui t2, <value>
    jalr    t1, t3

And then the first PLT entry just need to few modification, rewrite t2 to t4 to prevent t2/x7 corruption, and then put a landing pad with 0 at the beginning:

    lpad   0  # disable label checking 
1:  auipc  t4, %pcrel_hi(.got.plt)  # Rewrite this to using 
    sub    t1, t1, t3               # shifted .got.plt offset + hdr size + 12
    l[w|d] t3, %pcrel_lo(1b)(t4)    # _dl_runtime_resolve
    addi   t1, t1, -(hdr size + 12) # shifted .got.plt offset
    addi   t0, t4, %pcrel_lo(1b)    # &.got.plt
    srli   t1, t1, log2(16/PTRSIZE) # .got.plt offset
    l[w|d] t0, PTRSIZE(t0)          # link map
    jr     t3

And the last place is _dl_runtime_resolve, we also need to insert lapd 0 at begin of _dl_runtime_resolve to suppress landing pad check, and backup/restore t2 at prologue and epilogue.

Update: I was thinking adding one lpad is enough but actually it still need one more x7 setup instruction in PLT stuff if we don't want split/duplicate PLT entry for one symbol

Aug 18 '23 13:08 kito-cheng

@sorear

(ADDED) Do all non-STV_HIDDEN function symbols have a landing pad? I am very strongly leaning towards "yes"; any C function, C++ function, static method, template function, or operator can have its address taken from another module, which only leaves constructors and non-virtual destructors, and there aren't enough of those to matter. The only impact of a "yes" is code size for a small fraction of functions in non-C languages; a "no" answer would force a "yes" to the next question and additionally requires STO_RISCV_HAS_LP to be present at runtime.

I suspect we need to put landing pad for STV_HIDDEN functions too, unless we can prove the symbol address isn't escaped, I mean the function address isn't store in any where.

Aug 18 '23 13:08 kito-cheng

It seems to me that the PLT entry that uses a single indirect jump instruction to jump to either the beginning of the .plt section or the desired function is fundamentally not very friendly to the concept of landing pad labelling, because the two destinations are completely different.

Have you considered a PLT entry like this?

    lpad    <the-same-label-as-the-function>
1:  auipc   t3, %pcrel_hi(<[email protected]>)
    l[w|d]  t3, %pcrel_lo(1b)(t3)
    beq     t3, zero, 2f # .got.plt entry is initialized to 0
    jr      t3
2:  mv      t3, ra    # save the original return register
3:  auipc   ra, %pcrel_hi(plt0) # call plt0 as if it were a direct function call
    jalr    %pcrel_lo(3b)(ra)

The above PLT entry jumps to the beginning of the .plt as a direct call, so we don't need a landing pad there. I believe we don't want to have it there because the PLT resolver has been abused as an interesting attack vector. With the above PLT, an attacker cannot jump to plt0 without going through a PLT entry, and they cannot jump to a PLT entry unless they set the correct label for the PLT entry.

Oct 03 '23 05:10 rui314

    lpad    <the-same-label-as-the-function>
1:  auipc   t3, %pcrel_hi(<[email protected]>)
    l[w|d]  t3, %pcrel_lo(1b)(t3)
    beq     t3, zero, 2f # .got.plt entry is initialized to 0
    jr      t3
2:  mv      t3, ra    # save the original return register
3:  auipc   ra, %pcrel_hi(plt0) # call plt0 as if it were a direct function call
    jalr    %pcrel_lo(3b)(ra)

I would like to prevent branch in the PLT stub if possible, that may disturbing branch predictor and occupy more entry in branch predictor.

also we need to reset t2 to the <the-same-label-as-the-function>, because the caller might be a direct call, which means no t2 setup before call

Oct 03 '23 08:10 kito-cheng

Actually transferring control to a PLT entry as an indirect function call is relatively rare; if a source file is compiled as PIC, a function pointer would be read from the GOT, and the value there is the address of the function itself (i.e. not its PLT entry), so the PLT is skipped unless it is called directly.

The only case in which a PLT entry is called as an indirect function call is when you compile a source file as PDE. In that case, the code assumes that all functions and data objects are within the main executable. If a function turned out to be an imported one, the linker uses its PLT entry in the main executable as an address of the function. Only in this case the PLT entry could be called as an indirect function.

That means PLT entries in a DSO or a PIE don't need lpad at all.

Oct 03 '23 09:10 rui314

One idea is to require the compiler to always read a function address from GOT for an indirect function call even for PDE. We already require it for possibly undefined weak symbols as explained in https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-elf.adoc#medium-any-code-model, so it shouldn't be a foreign idea. With that, we can eliminate the leading lpad from each PLT entry.

Oct 03 '23 10:10 rui314

Further discussion happening in https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/434 . Closing this ticket to continue discussion in the psABI PR.

Jul 03 '24 09:07 ved-rivos

riscv-cfi riscv-cfi copied to clipboard

PLT stubs with landing pad

riscv-cfi
riscv-cfi copied to clipboard