Side exit temperature requires exponential backoff
The temperature field used in _PyExitData (aka struct _exit_data) currently is a simple counter. When it reaches 15 we try to optimize the side exit's target; when this optimization fails we reset the counter to -16, meaning that after 32 times taking this side exit we will retry again.
This is fine if something has changed that makes the side exit's target eventually optimizable; however, in cases where it deterministically fails (e.g.: when it encounters an unsupported opcode early on in the trace, or when the abstract interpreter gets a miss from the function version cache), we can waste a lot of time attempting to optimize the same piece of code over and over.
We should implement exponential backoff here just like we do for the backedge threshold.
PS: There's another threshold, resume_threshold, that doesn't appear to be used at all (though there is a failr bit of infrastructure around it that obscure this fact). My guess is that we intended to use this one for the side exit threshold but we changed the direction of the counter because of the idea to name it "temperature", and then introduced a new threshold variable for that.
Linked PRs
- gh-117144
PS: There's another threshold,
resume_threshold, that doesn't appear to be used at all (though there is a failr bit of infrastructure around it that obscure this fact). My guess is that we intended to use this one for the side exit threshold but we changed the direction of the counter because of the idea to name it "temperature", and then introduced a new threshold variable for that.
I believe the original plan was to begin tracing both on RESUME and JUMP_BACKWARD (hence having separate thresholds for each). In practice, it's probably sufficient to have JUMP_BACKWARD plus side exits, like we do now. But we may wish to project traces for RESUME in the future.
IMO we should not be maintaining infrastructure like this (which can easily be copied) until we actually need it.
@markshannon Mind if I take a look at this? I'd like to understand exactly how we do exponential backoff in other cases and maybe generalize that in the form of macros or inline functions.
Lab notes:
This description of the specialization counters probably belongs in a comment somewhere:
- The exponential backoff in specializing instructions uses inline functions defined in pycore_code.h, notably
adaptive_counter_bits()and following, and in ceval_macros.h, notablyADAPTIVE_COUNTER_IS_ZERO(). - They use the top 12 bits for a
valueand the bottom 4 for abackoffquantity. -They treat the top 12 bits as unsigned, and count down usingDECREMENT_ADAPTIVE_COUNTER()until they've reached zero. - When they need to back off, increment the
backoffpart (unless it's 12 already) and then set the value to a mask containing that many low bits set. Example: ifbackoffafter incrementing it is 7, the value is set to0x7f(binary0b_0111_1111or127decimal). - This gives a nice backoff sequence (powers of 2 until
backoffis saturated): 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 4096, 4096... - There are two special values,
cooldownto which the counter is initialized after a successful specialization (it sets thevaluepart to 52 and thebackoffpart to 0), andwarmup, to which it is initialized by_PyCode_Quicken()when a code object is initialized (value=1,backoff=1). - The unusual
cooldownvalue is designed so that after a successful specialization, if the specialized instruction has to deoptimize repeatedly, after 53 times we try to re-specialize. If that fails we start the above backoff sequence. - Everything is completely parametrized, the algorithm can partition the 16-bit counter differently by changing
ADAPTIVE_BACKOFF_BITS.
Somehow, for JUMP_BACKWARD a totally different backoff algorithm is used to decide when to attempt creating a Tier 2 executor.
- It's still splitting the 16-bit counter value into a 12-bit
valueand a 4-bitbackoffpart. - It counts up instead of down.
- The
valuepart is treated as a signed number. - After emulating this on paper and in Python code, I determined that the backoff sequence is 16, 32, 48, 80, 144, 272, 528, 1040, 2064, 2064, 2064, ...
- There's a reason to start the backoff sequence at 16 here:
My proposal is to unify the two algorithms.
- We can initialize the
JUMP_BACKWARDcounter to(value=16, backoff=0)usingadaptive_counter_bits(16, 0)in_PyCode_Quicken(). - This leads to the slightly modified backoff sequence for executor creation attempts of 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 4096, 4096...
- Unless there's a magical reason why the original backoff sequence (16, 32, 48, 80, ...) is mysteriously better, this looks close enough to me.
- This loses the "tuning" ability that is provided by
optimizer->backoff_threshold(andinterp->optimizer_backoff_threshold). We currently don't use this ability though. - We could keep some amount of tuning by having
_PyCode_Quicken()get the initial value from the interpreter, though this makes it harder to adjust the threshold dynamically. - Another possibility would be to initialize the counter to 0 in
_PyCode_Quicken()and special-case this value in theJUMP_BACKWARDinstruction. - But honestly I'd call YAGNY on this tuning capability -- if and when we decide that we need to tune the threshold for tier 2 we can implement the mechanism we actually need. Having the initial value be a build-time constant is all the tuning we need for now.
For the temperature (or hotness) of side exits we can then use the same exponential backoff code, possibly with a different initial value. The current sequence is 32, 32, 32, ...; we could change this to 32, 64, 128, 256, 512, 1024, 2048, 4096, 4096, 4096...
I think I've talked myself into creating a draft PR.
Oh, I missed something important. Optimizers may come and go and the counter in JUMP_BACKWARD should be compared to the current optimizer's threshold. This is important because the default (dummy) optimizer has an infinite threshold, while the UOps optimizer has a threshold of 16. So when we execute a loop 1000 times with the dummy optimizer, the counter is expected to reach 1000, and then when we switch in the UOps optimizer it will attempt to optimize on the next iteration (because 1000 is already larger than 16, the new threshold -- and presumably at this point the specializer has already worked its magic).
I'll have to figure a compromise.
Mind if I take a look at this?
Not at all
Once, a long time ago, the initial counter version was part of the on-disk format, so we couldn't use bit fields. Now it is set at runtime, so it might make sense to use bitfields and let the C compiler handle all the shifting and masking. Something like:
struct counter {
int16_t value:12;
uint16_t backoff:4;
}
@markshannon Before I try to implement the bitfield suggestion, can I get your opinion on my proposal above?
Unifying the counters seems reasonable. They all do much the same thing.
I'd like to keep the thresholds (at least for now) for two reasons:
- It provides a simple way to turn off tier2, by setting the threshold higher than the counter can go.
- We don't know yet what values we want for the various thresholds, so we might want to change them dynamically.
We can get rid of the thresholds, if it makes things significantly simpler.
We can support turning off tier 2 by changing if ((++counter) >= interp->threshold) to if ((counter -= interp->tier2) == 0)
We can handle dynamic thresholds in the optimizer if necessary.
Actually, there is no need for interp->tier2.
if ((--counter) == 0) is faster, and exponential backoff means we won't be calling the optimizer function much.
Although objectively the PR makes things worse (it is larger and perhaps slower), I still think factoring out the counter is a good idea.
I think that the various optimizer thresholds are a blocker. So what happens if we always trigger the counter on zero, removing the thresholds?
- We need to initialize the counters to some value, and that has to be a constant for practical purposes.
- We will call the optimizer even if it is turned off.
(1) limits our options to vary the thresholds at runtime, but I'm not seeing any value in being able to do that anyway. We use constants for tier 1 and that seems to work just fine. (2) may actually result in a speedup. It is slower when we do call the no-op optimizer, but faster the thousands of times when we do not.
I've removed the thresholds as an experiment. I'm running the benchmarks now.
Thanks! Looking forward to the results.
I have a theory BTW about why my PR makes things slower -- it could be that the bitfields end up generating worse code for either the zero check or the decrement operation, and that would affect Tier 1 (especially cases where specializations fails).
Strangely slow https://github.com/faster-cpython/benchmarking/tree/main/results/bm-20240328-3.13.0a5%2B-174fc24-JIT I may have messed something up. Maybe worth running the stats on it to see what's happening
More discussion in the PR. I've removed the thresholds there as well and done some limited local benchmarking; my theory is that the tweaks cause more Tier 2 uops to be executed and that's still slower (even with the JIT, though less slower in that case).