ChezScheme
ChezScheme copied to clipboard
Use GOTPCRELX-like optimization for x86-64 code objects
Currently, the x86-64 call sequence is rather involved. From c/fasl.c:
static void x86_64_set_jump(void *address, uptr item, IBOOL callp) {
I64 disp = (I64)item - ((I64)address + 5); /* 5 = size of call instruction */
if ((I32)disp == disp) {
*(octet *)address = callp ? 0xE8 : 0xE9; /* call or jmp disp32 opcode */
*(I32 *)((uptr)address + 1) = (I32)disp;
*((octet *)address + 5) = 0x90; /* nop */
*((octet *)address + 6) = 0x90; /* nop */
*((octet *)address + 7) = 0x90; /* nop */
*((octet *)address + 8) = 0x90; /* nop */
*((octet *)address + 9) = 0x90; /* nop */
*((octet *)address + 10) = 0x90; /* nop */
*((octet *)address + 11) = 0x90; /* nop */
} else {
*(octet *)address = 0x48; /* REX w/REX.w set */
*((octet *)address + 1)= 0xB8; /* MOV imm64 to RAX */
*(uptr *)((uptr)address + 2) = item;
*((octet *)address + 10) = 0xFF; /* call/jmp reg/mem opcode */
*((octet *)address + 11) = callp ? 0xD0 : 0xE0; /* mod=11, ttt=010 (call) or 100 (jmp), r/m = 0 (RAX) */
}
}
It is possible to use a six-byte CALL or JMP instruction sequence for both 32-bit and 64-bit addressing. If a 32-bit displacement is sufficient, 0x67 xe8 (CODE32 CALL rel32) or 0xe9 (JMP rel32; with a 1-byte NOP after the displacement) work. Otherwise, use 0xff 0x15 (CALL m64) or 0xff 0x25 (JMP m64) with a RIP-relative memory operand. This needs a relocation table at the end of the code object, but I think it already exists today. Code objects will be restricted to 2 GiB in size, but that's a limitation that exists today, I assume.
This technique is described under Linker Optimization in the x86-64 psABI supplement.
In the interim, it would be possible to use a single 7-byte NOP instruction. binutils 2.31 recommends 0x0f 0x1f 0x80 0x00 0x00 0x00 0x00. All x86-64 processors (going back to the original K8) support these long NOP instructions.
There is not presently a relocation table at the end of the code object, but I'm open to adding one, and a 2GB code-object size limit doesn't bother me, as long as it's explicitly enforced. We'd have to verify that performance doesn't suffer. I worry that adding a memory reference (to the relocation table) might reduce performance, but it could also get faster due to the shorter code sequence taking up less room in the cache.
The call-with-memory-reference approach is used for inter-module calls on the x86-64 psABI (in the PLT stubs). I expect that CPUs handle it very well because it is rather important for good performance. (It's very noticeable that Silvermont can't predict these calls properly if the other module is too far away in the address space.) If I recall correctly, one caveat is that older AMD CPUs have exclusive I and D caches, so optimal placement of the relocation table is tricky. I would have to ask around to figure out if this still affects current CPUs. This issue does not arise in the ELF context because there, the table is not colocated with the instructions.