cpython Side exit temperature requires exponential backoff

The temperature field used in _PyExitData (aka struct _exit_data) currently is a simple counter. When it reaches 15 we try to optimize the side exit's target; when this optimization fails we reset the counter to -16, meaning that after 32 times taking this side exit we will retry again.

This is fine if something has changed that makes the side exit's target eventually optimizable; however, in cases where it deterministically fails (e.g.: when it encounters an unsupported opcode early on in the trace, or when the abstract interpreter gets a miss from the function version cache), we can waste a lot of time attempting to optimize the same piece of code over and over.

We should implement exponential backoff here just like we do for the backedge threshold.

PS: There's another threshold, resume_threshold, that doesn't appear to be used at all (though there is a failr bit of infrastructure around it that obscure this fact). My guess is that we intended to use this one for the side exit threshold but we changed the direction of the counter because of the idea to name it "temperature", and then introduced a new threshold variable for that.

Linked PRs

gh-117144

Mar 18 '24 18:03 gvanrossum

PS: There's another threshold, resume_threshold, that doesn't appear to be used at all (though there is a failr bit of infrastructure around it that obscure this fact). My guess is that we intended to use this one for the side exit threshold but we changed the direction of the counter because of the idea to name it "temperature", and then introduced a new threshold variable for that.

I believe the original plan was to begin tracing both on RESUME and JUMP_BACKWARD (hence having separate thresholds for each). In practice, it's probably sufficient to have JUMP_BACKWARD plus side exits, like we do now. But we may wish to project traces for RESUME in the future.

Mar 18 '24 23:03 brandtbucher

IMO we should not be maintaining infrastructure like this (which can easily be copied) until we actually need it.

Mar 18 '24 23:03 gvanrossum

@markshannon Mind if I take a look at this? I'd like to understand exactly how we do exponential backoff in other cases and maybe generalize that in the form of macros or inline functions.

Mar 20 '24 19:03 gvanrossum

Lab notes:

This description of the specialization counters probably belongs in a comment somewhere:

The exponential backoff in specializing instructions uses inline functions defined in pycore_code.h, notably adaptive_counter_bits() and following, and in ceval_macros.h, notably ADAPTIVE_COUNTER_IS_ZERO().
They use the top 12 bits for a value and the bottom 4 for a backoff quantity. -They treat the top 12 bits as unsigned, and count down using DECREMENT_ADAPTIVE_COUNTER() until they've reached zero.
When they need to back off, increment the backoff part (unless it's 12 already) and then set the value to a mask containing that many low bits set. Example: if backoff after incrementing it is 7, the value is set to 0x7f (binary 0b_0111_1111 or 127 decimal).
This gives a nice backoff sequence (powers of 2 until backoff is saturated): 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 4096, 4096...
There are two special values, cooldown to which the counter is initialized after a successful specialization (it sets the value part to 52 and the backoff part to 0), and warmup, to which it is initialized by _PyCode_Quicken() when a code object is initialized (value=1, backoff=1).
The unusual cooldown value is designed so that after a successful specialization, if the specialized instruction has to deoptimize repeatedly, after 53 times we try to re-specialize. If that fails we start the above backoff sequence.
Everything is completely parametrized, the algorithm can partition the 16-bit counter differently by changing ADAPTIVE_BACKOFF_BITS.

Somehow, for JUMP_BACKWARD a totally different backoff algorithm is used to decide when to attempt creating a Tier 2 executor.

It's still splitting the 16-bit counter value into a 12-bit value and a 4-bit backoff part.
It counts up instead of down.
The value part is treated as a signed number.
After emulating this on paper and in Python code, I determined that the backoff sequence is 16, 32, 48, 80, 144, 272, 528, 1040, 2064, 2064, 2064, ...
There's a reason to start the backoff sequence at 16 here:

My proposal is to unify the two algorithms.

We can initialize the JUMP_BACKWARD counter to (value=16, backoff=0) using adaptive_counter_bits(16, 0) in _PyCode_Quicken().
This leads to the slightly modified backoff sequence for executor creation attempts of 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 4096, 4096...
Unless there's a magical reason why the original backoff sequence (16, 32, 48, 80, ...) is mysteriously better, this looks close enough to me.
This loses the "tuning" ability that is provided by optimizer->backoff_threshold (and interp->optimizer_backoff_threshold). We currently don't use this ability though.
We could keep some amount of tuning by having _PyCode_Quicken() get the initial value from the interpreter, though this makes it harder to adjust the threshold dynamically.
Another possibility would be to initialize the counter to 0 in _PyCode_Quicken() and special-case this value in the JUMP_BACKWARD instruction.
But honestly I'd call YAGNY on this tuning capability -- if and when we decide that we need to tune the threshold for tier 2 we can implement the mechanism we actually need. Having the initial value be a build-time constant is all the tuning we need for now.

For the temperature (or hotness) of side exits we can then use the same exponential backoff code, possibly with a different initial value. The current sequence is 32, 32, 32, ...; we could change this to 32, 64, 128, 256, 512, 1024, 2048, 4096, 4096, 4096...

I think I've talked myself into creating a draft PR.

Mar 21 '24 22:03 gvanrossum

Oh, I missed something important. Optimizers may come and go and the counter in JUMP_BACKWARD should be compared to the current optimizer's threshold. This is important because the default (dummy) optimizer has an infinite threshold, while the UOps optimizer has a threshold of 16. So when we execute a loop 1000 times with the dummy optimizer, the counter is expected to reach 1000, and then when we switch in the UOps optimizer it will attempt to optimize on the next iteration (because 1000 is already larger than 16, the new threshold -- and presumably at this point the specializer has already worked its magic).

I'll have to figure a compromise.

Mar 21 '24 23:03 gvanrossum

Mind if I take a look at this?

Not at all

Mar 22 '24 10:03 markshannon

Once, a long time ago, the initial counter version was part of the on-disk format, so we couldn't use bit fields. Now it is set at runtime, so it might make sense to use bitfields and let the C compiler handle all the shifting and masking. Something like:

struct counter {
    int16_t value:12;
    uint16_t backoff:4;
}

Mar 22 '24 11:03 markshannon

@markshannon Before I try to implement the bitfield suggestion, can I get your opinion on my proposal above?

Mar 22 '24 22:03 gvanrossum

Unifying the counters seems reasonable. They all do much the same thing.

I'd like to keep the thresholds (at least for now) for two reasons:

It provides a simple way to turn off tier2, by setting the threshold higher than the counter can go.
We don't know yet what values we want for the various thresholds, so we might want to change them dynamically.

Mar 26 '24 15:03 markshannon

We can get rid of the thresholds, if it makes things significantly simpler.

We can support turning off tier 2 by changing if ((++counter) >= interp->threshold) to if ((counter -= interp->tier2) == 0) We can handle dynamic thresholds in the optimizer if necessary.

Mar 27 '24 10:03 markshannon

Actually, there is no need for interp->tier2. if ((--counter) == 0) is faster, and exponential backoff means we won't be calling the optimizer function much.

Mar 27 '24 11:03 markshannon

Although objectively the PR makes things worse (it is larger and perhaps slower), I still think factoring out the counter is a good idea.

I think that the various optimizer thresholds are a blocker. So what happens if we always trigger the counter on zero, removing the thresholds?

We need to initialize the counters to some value, and that has to be a constant for practical purposes.
We will call the optimizer even if it is turned off.

(1) limits our options to vary the thresholds at runtime, but I'm not seeing any value in being able to do that anyway. We use constants for tier 1 and that seems to work just fine. (2) may actually result in a speedup. It is slower when we do call the no-op optimizer, but faster the thousands of times when we do not.

I've removed the thresholds as an experiment. I'm running the benchmarks now.

Mar 28 '24 14:03 markshannon

Thanks! Looking forward to the results.

I have a theory BTW about why my PR makes things slower -- it could be that the bitfields end up generating worse code for either the zero check or the decrement operation, and that would affect Tier 1 (especially cases where specializations fails).

Mar 28 '24 14:03 gvanrossum

Strangely slow https://github.com/faster-cpython/benchmarking/tree/main/results/bm-20240328-3.13.0a5%2B-174fc24-JIT I may have messed something up. Maybe worth running the stats on it to see what's happening

Mar 28 '24 17:03 markshannon

More discussion in the PR. I've removed the thresholds there as well and done some limited local benchmarking; my theory is that the tweaks cause more Tier 2 uops to be executed and that's still slower (even with the JIT, though less slower in that case).

Mar 29 '24 19:03 gvanrossum