yjit
yjit copied to clipboard
Avoid moving REG_SP (make it a base pointer)
This is an idea to improve how we handle stack addresses to make codegen simpler to write and possibly improve performance.
Currently, REG_SP
always has the same value of reg_cfp->sp
, but to avoid unnecessary register/memory reads/writes we store an sp_offset
and build our memory operands (ctx_sp_opnd
) using that offset. When we call into C code, call another method, or return to the interpreter, we call jit_save_sp
, which updates both REG_SP
and REG_CFP->sp
.
This causes some awkwardness in our codegen, because any memory operands we build using ctx_sp_opnd
become invalid when REG_SP
is updated (since they are relative to REG_SP
). This requires us to be careful about when these operands are created vs used.
For example in gen_opt_aref
// About to change REG_SP which these operands depend on. Yikes.
mov(cb, R8, recv_opnd);
mov(cb, R9, idx_opnd);
// Write sp to cfp->sp since rb_hash_aref might need to call #hash on the key
jit_save_sp(jit, ctx);
yjit_save_regs(cb);
mov(cb, C_ARG_REGS[0], R8);
mov(cb, C_ARG_REGS[1], R9);
I think it would be possible to avoid this issue by treating REG_SP
as a base pointer of the stack rather than a moving stack pointer (should we also rename it?). Basically this means that REG_SP
would not move throughout the iseq, it would keep the same value it had at the YJIT entry point. Updating REG_CFP->sp
would be done in the same way in the same places, however we'd now use a temporary register instead of updating REG_SP
(same as we do for jit_save_pc
).
In addition to making codegen simpler to write, I believe this will reduce the number of block versions.
Currently there is this comment at the end of gen_send_cfunc
:
// Note: gen_send_iseq() jumps to the next instruction with ctx->sp_offset == 0
// after the call, while this does not. This difference prevents
// the two call types from sharing the same successor.
Calling an iseq ends up with an sp_offset
of 0, but calling a cfunc ends up with an sp_offset
of 1 (we must push the return value from the cfunc ourselves). If we instead use the base SP rather than a moving SP, the sp_offset
s will be the same leaving both basic blocks, therefore have compatible ctx, and can jump into the same target.
Another potential advantage (less sure of this) is that by not moving the register we will avoid a data dependency in following stack operations, which may improve CPU pipelining by avoiding stalls.
What you wrote makes sense. The main question I have is: can we count on the __bp__
field in rb_control_frame_struct
always being present? Does MJIT currently make use of this?
Another potential detail is: I remember Alan talking about using the call-threaded handlers so we can fall back to the interpreter for a single instruction, and in that case, we might end up in a situation where we don't know what the SP is.
I think that, if we only use a BP, we don't care what the value of the SP is as much in YJIT. However, we probably would still need to always make sure that the SP is up to date whenever we exit JITted code. That's a bit of a bummer, because it means we probably would have to keep the sp_offset
and we may have to write both SP and BP values to stack frames at any call or side-exit boundaries.
@XrXr any comments?
Yeah I think this make sense. I feel confident relying on __bp__
since the interpreter checks it in leave
.
https://github.com/ruby/ruby/blob/5534698b84c1ef1567ebb1e2d79fbe1a2a573a77/insns.def#L902
We shouldn't need to change __bp__
cause it's supposed to stay constant for the frame. For one-shot fallback to the interpreter, I needed to know the SP delta for the right sp_offset
anyways, so that won't be problem.
A note for implementation, since sp_offset
is only 16 bits, we need to take care to test for large stacks.
can we count on the
__bp__
field inrb_control_frame_struct
always being present? Does MJIT currently make use of this
Currently it's always present, but I don't think we should rely on it. https://github.com/Shopify/yjit/blob/main/vm_insnhelper.c#L2172-L2200 suggests it's likely to be removed, but I believe we'll be able to calculate BP as SP - stack_size
.
I think that, if we only use a BP, we don't care what the value of the SP is as much in YJIT. However, we probably would still need to always make sure that the SP is up to date whenever we exit JITted code. That's a bit of a bummer, because it means we probably would have to keep the sp_offset and we may have to write both SP and BP values to stack frames at any call or side-exit boundaries.
We'll always have to write cfp->sp on side exits (as we already do today). I don't think we'll have to update __bp__
as it shouldn't move, but we may need to recalculate (or re-fetch it from cfp) if we re-enter the jitted code.
So we'd keep sp_offset
, compute the BP when entering JITted code, update the SP when leaving or doing a fallback, update BP on method calls.
I'm not sure what the impact on code size would be. I think it might remove some instructions in a few places, but also add a few more in other places since we'd be updating the SP still 🤔 May remove some data dependencies as you said.
BP == SP
on entry to a frame, so no calculation needed there. The interpreter is checking that when leaving, SP is the same as it were on entry to the frame. But we do need to make calculations or load from cfp
when we leave from output code to output code. Actually, I think this extra calculation/load is why we moved away from keeping the stack size to keeping an offset originally.
The dependency thing is probably not going to make much of a perf difference since the limiting factor on stalls is probably memory access around the relatively quick sp calculation. The main win I see for making this change is making it easier to write codegen / develop the new backend.