Trace recording JIT
This issue tracks progress of changing the JIT from one that projects traces to one that records them.
Follow-ups:
- Embed ENTER_EXECUTOR's executor in the bytecode and on trace graphs?
- Do not increase chain depth when side-exiting a branch, as that is just normal control-flow.
- Optimize through CALL_ALLOC_AND_ENTER_INIT
- Specialize: CALL_FUNCTION_EX,
__init__of slots, SEND foryield from - Loop peel.
Linked PRs
- gh-139111
- gh-140310
- gh-141573
- gh-141703
Just to flesh this out a bit:
Motivation
Currently we project traces from the initial hotspot, but there are a number of places in the projection where we would benefit from having live values as well as the statistical information gathered by the interpreter.
We can combine the information from live values with the statistical information to give us a trace that should, in theory, be better than from either pure trace recording or from projection.
Design
Confidence
As a trace gets longer, it is less likely that a future execution will remain on trace. Our estimate of that future likelihood is our "confidence" in the trace at that point. All traces start with confidence at 100%, but that confidence drops with each guard in the trace.
The trace recording interpreter
The trace recording interpreter operates exactly as the normal tier 1 interpreter, while recording the instructions executed.
While trace recording we should maintain a confidence level, but this confidence level will not be as accurate as the optimizer can provide. Since the optimizer might be more confident, and we cannot optimize beyond the end of the recorded trace, the confidence limit for recording should be lower than for the final trace.
Possible implementations
We can:
- Record the trace at the bytecode level, then lower to micro-ops as a separate pass
- Lower as we go. This is more complicated, but may be faster as it saves recording the bytecode.
I think we should start with 1. We can move to 2. later without impacting the rest of the JIT.
To avoid doubling the size of the interpreter, each recording opcode implementation should call a common recording function then jump to the normal implementation of that opcode.
However, some specialized instructions could "deopt", executing the non-specialized instruction. This would result in a misleading trace, so we might want to somehow force re-specialization in that case. It is not obvious how to do that without duplicating all specialized instructions. We could force re-specialization of all instructions, that might be slow.
@Fidget-Spinner
We can keep the switch interpreter working with the tracing JIT using the same technique as we did for sys.settrace tracing in 3.11.
- Change Tools/cases_generator/analyzer.py to change ENTER_EXECUTOR to 254, and lower the number of all the instrumented instructions by one, leaving 255 as reserved. Call opcode 255
INTERNAL_TRACING. - In PyEval_EvalDefault add a local variable
tracingwhich is set to 0 if tracing is off and 255 if tracing is on. - The
INTERNAL_TRACINGinstruction should jump torecord_previous_instwhich should dispatch using the plainopcodewithouttracing. It will be unused for computed gotos and tailcalling. - Dispatch is a bit more complicated. Instead of dispatching on
opcode, dispatch ondispatch_opcode. Before thegoto dispatch, the variousDISPATCHmacros need to setdispatch_opcode:- Non-tracing dispatch:
dispatch_opcode = opcode - Normal dispatch:
dispatch_opcode = opcode | tracing - Tracing dispatch:
dispatch_opcode = 255
- Non-tracing dispatch:
Removing the 3.11 scheme was worth about 1% speedup, so adding it back will slow the interpreter at least 1% as this is more complex than the 3.11 scheme. However, if we make the JIT faster, we should get that back and more.
Once https://github.com/python/cpython/pull/140310 is merged, there are a few more tasks to do:
- [x] Restore switch based interpreter
- [ ] Record values, so we can get rid of the ad-hoc caches for functions, classes and code objects.
- [ ] Document how it works
- [ ] Improve efficiency and thread safety by storing the buffer on interpreter and swapping it to the thread when needed.
- [ ] Restore some sort of confidence estimate to tracing or optimization or both, to avoid overly long traces
- [ ] Restore the optimization for stack size checks and its tests
- [ ] Remove some code duplication by setting the previous instruction to
NOPwhen entering tracing.
The only test we need to restore is the stack size optimization checks. That's all the tests that are disabled for now.
Another action item: prime number table for the backoff counters. This will allow for a trace of a "better" iteration. Note that pypy also uses a "strange" number for warmup of loops.
https://github.com/python/cpython/issues/141498
I found a serious perf regression on nbody. The problem is that stop_tracing adds a _DEOPT when it sees ENTER_EXECUTOR, thus causing the traces to stop growing. I put _EXIT_TRACE previously, but forgot about it when the JIT code was undergoing refactoring.
The fix is just to pass the opcode of the exit op in. Will do that later.
@Fidget-Spinner Is this closable now? Can we track bugs etc. in new issues? For example: https://github.com/python/cpython/issues/139109#issuecomment-3516990790?
Yeah lets do follow ups separately
I'd rather keep this open, otherwise it gives a false impression of being finished when there's some clean up to do. Or, make new issues for the list above and then close this one.
Restore the optimization for stack size checks and its tests
I think we should abandon this and just clean up the code implementing this optimisation. It's generally very finnicky and I don't think it nets us much perf wins.