cpython Trace recording JIT

This issue tracks progress of changing the JIT from one that projects traces to one that records them.

Follow-ups:

Embed ENTER_EXECUTOR's executor in the bytecode and on trace graphs?
Do not increase chain depth when side-exiting a branch, as that is just normal control-flow.
Optimize through CALL_ALLOC_AND_ENTER_INIT
Specialize: CALL_FUNCTION_EX, __init__ of slots, SEND for yield from
Loop peel.

Linked PRs

gh-139111
gh-140310
gh-141573
gh-141703

Sep 18 '25 10:09 Fidget-Spinner

Just to flesh this out a bit:

Motivation

Currently we project traces from the initial hotspot, but there are a number of places in the projection where we would benefit from having live values as well as the statistical information gathered by the interpreter.

We can combine the information from live values with the statistical information to give us a trace that should, in theory, be better than from either pure trace recording or from projection.

Design

Confidence

As a trace gets longer, it is less likely that a future execution will remain on trace. Our estimate of that future likelihood is our "confidence" in the trace at that point. All traces start with confidence at 100%, but that confidence drops with each guard in the trace.

The trace recording interpreter

The trace recording interpreter operates exactly as the normal tier 1 interpreter, while recording the instructions executed.

While trace recording we should maintain a confidence level, but this confidence level will not be as accurate as the optimizer can provide. Since the optimizer might be more confident, and we cannot optimize beyond the end of the recorded trace, the confidence limit for recording should be lower than for the final trace.

Possible implementations

We can:

Record the trace at the bytecode level, then lower to micro-ops as a separate pass
Lower as we go. This is more complicated, but may be faster as it saves recording the bytecode.

I think we should start with 1. We can move to 2. later without impacting the rest of the JIT.

To avoid doubling the size of the interpreter, each recording opcode implementation should call a common recording function then jump to the normal implementation of that opcode.

However, some specialized instructions could "deopt", executing the non-specialized instruction. This would result in a misleading trace, so we might want to somehow force re-specialization in that case. It is not obvious how to do that without duplicating all specialized instructions. We could force re-specialization of all instructions, that might be slow.

Sep 23 '25 17:09 markshannon

@Fidget-Spinner We can keep the switch interpreter working with the tracing JIT using the same technique as we did for sys.settrace tracing in 3.11.

Change Tools/cases_generator/analyzer.py to change ENTER_EXECUTOR to 254, and lower the number of all the instrumented instructions by one, leaving 255 as reserved. Call opcode 255 INTERNAL_TRACING.
In PyEval_EvalDefault add a local variable tracing which is set to 0 if tracing is off and 255 if tracing is on.
The INTERNAL_TRACING instruction should jump to record_previous_inst which should dispatch using the plain opcode without tracing. It will be unused for computed gotos and tailcalling.
Dispatch is a bit more complicated. Instead of dispatching on opcode, dispatch on dispatch_opcode. Before the goto dispatch, the various DISPATCH macros need to set dispatch_opcode:
- Non-tracing dispatch: dispatch_opcode = opcode
- Normal dispatch: dispatch_opcode = opcode | tracing
- Tracing dispatch: dispatch_opcode = 255

Removing the 3.11 scheme was worth about 1% speedup, so adding it back will slow the interpreter at least 1% as this is more complex than the 3.11 scheme. However, if we make the JIT faster, we should get that back and more.

Oct 29 '25 17:10 markshannon

Once https://github.com/python/cpython/pull/140310 is merged, there are a few more tasks to do:

[x] Restore switch based interpreter
[ ] Record values, so we can get rid of the ad-hoc caches for functions, classes and code objects.
[ ] Document how it works
[ ] Improve efficiency and thread safety by storing the buffer on interpreter and swapping it to the thread when needed.
[ ] Restore some sort of confidence estimate to tracing or optimization or both, to avoid overly long traces
[ ] Restore the optimization for stack size checks and its tests
[ ] Remove some code duplication by setting the previous instruction to NOP when entering tracing.

Nov 11 '25 13:11 markshannon

The only test we need to restore is the stack size optimization checks. That's all the tests that are disabled for now.

Nov 11 '25 15:11 Fidget-Spinner

Another action item: prime number table for the backoff counters. This will allow for a trace of a "better" iteration. Note that pypy also uses a "strange" number for warmup of loops.

Nov 13 '25 16:11 Fidget-Spinner

https://github.com/python/cpython/issues/141498

Nov 13 '25 16:11 markshannon

I found a serious perf regression on nbody. The problem is that stop_tracing adds a _DEOPT when it sees ENTER_EXECUTOR, thus causing the traces to stop growing. I put _EXIT_TRACE previously, but forgot about it when the JIT code was undergoing refactoring.

The fix is just to pass the opcode of the exit op in. Will do that later.

Nov 14 '25 18:11 Fidget-Spinner

@Fidget-Spinner Is this closable now? Can we track bugs etc. in new issues? For example: https://github.com/python/cpython/issues/139109#issuecomment-3516990790?

Dec 10 '25 17:12 savannahostrowski

Yeah lets do follow ups separately

Dec 10 '25 18:12 Fidget-Spinner

I'd rather keep this open, otherwise it gives a false impression of being finished when there's some clean up to do. Or, make new issues for the list above and then close this one.

Dec 16 '25 14:12 markshannon

Restore the optimization for stack size checks and its tests

I think we should abandon this and just clean up the code implementing this optimisation. It's generally very finnicky and I don't think it nets us much perf wins.

Dec 16 '25 21:12 Fidget-Spinner