rv32emu icon indicating copy to clipboard operation
rv32emu copied to clipboard

Specialize JALR Instruction

Open jserv opened this issue 1 year ago • 5 comments

The JALR (Jump and Link Register) instruction updates the program counter to a target address by adding a sign-extended 12-bit immediate to the value of the rs1 register and saves the address of the next instruction (PC+4) in the destination register. This instruction is versatile, supporting call and return operations, as well as register-indirect branching (which may require additional decoding for branch prediction).

Function calls are typically made using a JAL instruction to a label or a JALR instruction to a register rd. Specifically, these instructions should be JAL ra, label or JALR ra, rd, imm, though we may sometimes use the shorthand JAL label or JALR rd when imm is 0. The JAL instruction stores PC + 4 in ra, setting up the return address for after the function call, and increments the PC to the label's offset. JALR operates similarly but sets the PC to rd + imm.

To enhance JALR instruction simulation, we can consider specializing its usage scenarios. This could involve converting it into a push/pop pair or using it in specialized prologue/epilogue sequences for more efficient simulation.

Reference:

jserv avatar Feb 25 '24 09:02 jserv

In most operations, a 12-bit size is standard for offsets. This applies to immediate arithmetic and logical operations (e.g., addi, andi, slti), memory access instructions (e.g., lb, lwu), and branching offsets (e.g., beq, bge), including the jump and link register mechanism. However, there are exceptions: the lui instruction loads the top 20 bits of an immediate to assist in forming 32-bit values in conjunction with addi. Similarly, the auipc instruction facilitates PC-relative addressing by using the top 20 bits. The jal instruction differs by accepting a 21-bit offset, while jalr uses a 12-bit offset. These components enable robust operations, crucial for call or jump functions. It's important to note that these instructions sign-extend the offset, which is beneficial for loading negative values with single instructions but must be considered when dealing with values larger than the offset size and with activated sign bits.

jserv avatar Mar 01 '24 14:03 jserv

If the constant jump is compiled into an indirect jump by the toolchain, the jalr instruction can be specialized as auipc + jal. Based on my observation, there are some indirect jumps in the Dhrystone benchmark with highly frequent jump targets. However, after examining the object file, it is found to be a ret instruction, which will fail when others call this function if we translate it to auipc + jal.

Moreover, I used the second approach to verify this. We can translate the jalr instruction to be specialized as auipc + jal if the value in its rs1 register is constant. Because we have constant folding and propagation, we can examine this situation after constant optimization. However, after executing several benchmarks, I could not find a jalr instruction with a constant value in its rs1 register.

In conclusion, I think the scenarios where translating the jalr instruction into auipc + jalr is very rare in our benchmarks.

qwe661234 avatar Mar 02 '24 14:03 qwe661234

In conclusion, I think the scenarios where translating the jalr instruction into auipc + jalr is very rare in our benchmarks.

The ELF files located in the 'build' directory have been compiled with aggressive optimizations, which may result in linker relaxation for more efficient RISC-V instructions. To observe indirect jump patterns more clearly, we could consider recompiling some benchmarks without these optimizations.

jserv avatar Mar 02 '24 15:03 jserv

Let's look at a basic example of C code that accesses an array element:

int readidx(int *p, size_t idx)
{
    return p[idx];
}

This code snippet demonstrates a routine task of fetching an array element. When we compile this for the x86-64 architecture, the generated assembly might look quite straightforward:

mov eax, [rdi+rsi*4]
ret

On the other hand, for a RISC-V architecture, the assembly code appears a bit more involved due to its simpler instruction set:

slli a1, a1, 2
add a0, a1, a1
lw a0, a0, 0
jalr r0, r1, 0 // return

RISC-V's approach to simplifying the instruction set can make the initial decoding process by the CPU's frontend simpler, though it might mean more instructions need to be executed. If the aforementioned jalr instruction could be specifically tailored as a ret instruction, it would greatly simplify the execution process for virtual machines, making it more direct.

jserv avatar Mar 24 '24 15:03 jserv

In Arm architecture, the bl instruction (branch and link) records the return address in the link register (lr), a concept believed to be inherited from early RISC designs that emphasized orthogonality in instructions. According to this principle, only load/store operations interact with memory, while all other operations modify the internal CPU state.

While some architectures designate a specific link register, most RISC Instruction Set Architectures (ISAs) treat the link register as a general-purpose register (GPR), allowing branch-and-link instructions to target any register. This flexibility means branch-and-link instructions require additional bits (typically 5 more for architectures with 32 registers) to specify the target register, consuming valuable encoding space.

The necessity of the branch-and-link mechanism was debated in the early stages of RISC design. It was argued that compilers could replicate its functionality with a sequence of instructions, for example, adding the program counter (PC) to a register followed by a branch to the destination. This approach suggested that with PC-relative addressing, the functionalities of a general-purpose branch-and-link could be emulated, potentially saving encoding space for other purposes.

Early discussions on RISC architecture questioned the necessity of the branch-and-link mechanism, considering the possibility for compilers to emulate its function using a combination of instructions like adding to the lr and branching to a target. This consideration was particularly relevant given the program counter's (PC) specific behavior on Arm, where it is advanced by a certain number of bytes within the instruction sequence.

The decision to integrate branch-and-link into a single instruction was influenced by several factors:

  1. Subroutine calls are frequent operations, underscoring the utility of a specialized instruction for this purpose, despite some criticisms about the need for such specialized instructions for subroutine calls.
  2. A dedicated branch instruction doesn't alter any register other than the link register.
  3. Accessing the PC doesn't require an additional register read, as any instruction can reference the PC.

Many ISAs, including Arm, adopted a branch-and-link instruction that utilizes a predefined link register, streamlining the process for the majority of use cases. The architecture assumes that scenarios requiring an alternative destination for the subroutine call can be handled with a more explicit sequence of instructions, utilizing both an add and a branch command. This consensus reflects a broader trend in ISA design to balance the need for specialized instructions with the goal of efficient encoding and execution of common operations.

In a traditional move, RISC-V made a different choice by allocating 5 bits within its JAL / JALR instructions to signify a binary condition: whether the return address should be saved to the link register or ignored. This decision resulted in JAL and JALR (with J and JR being their respective aliases that discard the result by writing it to x0) using up 16 times more encoding space than necessary. This inefficiency is particularly problematic considering these instructions already require substantial encoding space due to their large immediates. JAL, for example, uses an entire major opcode, despite theoretically needing only a fraction of that for similar code density, practically to a rounding error. Consequently, this excessive allocation limited the encoding space available for more diverse addressing modes.

jserv avatar Mar 24 '24 22:03 jserv

For improving the indirect jump, we experiment three different implementations:

  1. Make no improvements, jump directly back to the interpreter.
  2. Current implementation, the interpreter records the target of the indirect jump, identifying the most frequent jump target for T1C. The T1 generated code for the indirect jump compares the selected jump target and executes the jump if the comparison results in equality.
  3. In the end of the indirect jump in the T1C-generated code, it invokes a helper function tasked with searching the hash map. This map records the machine code location corresponding to the specific Program Counter, and if it exists, directs the jump accordingly.

Based on the performance analysis, the current implementation is suitable for us. Unlike dynamic binary translation in QEMU, our T1C only generates code for potential hotspots. Therefore, it is more likely to miss when searching the hash map, as not all blocks would be translated through T1C.

image

qwe661234 avatar Apr 16 '24 06:04 qwe661234