cpython gh-119258: Eliminate Type Guards in Tier 2 Optimizer

This PR eliminates some unnecessary calls to _GUARD_TYPE_VERSION in the tier 2 interpreter, closing #119258.

It does it by symbolically tracking the version of the type of each object and eliminating the type guard if the checked version is the same as the tracked one.

It also has to verify that there were no operations that might have escaped since we recorded the type version. So we store the last instruction offset that could lead to an escape (and possibly change the type version), and also store at which instruction offset we stored the version. We can compare these when optimizing, to make sure to only remove the bytecode if we haven't escaped after recording the version.

Work on this PR was done at the PyCon sprints with @dpdani and with help from @Fidget-Spinner and @markshannon

Development

In order to try out this commit, you have to configure it with the experimental jit:

./configure --with-pydebug --enable-experimental-jit=interpreter

Then whenever you change the cases you have to run make regen-cases and then make -j.

Finally, to just run one test case, you can use the --match:

./python.exe -m test -v test_capi.test_opt --match test_guard_type_version_removed
./python.exe -m test -v test_capi.test_opt --match test_guard_type_version_not_removed

Issue: gh-119258

May 20 '24 20:05 saulshanabrook

All commit authors signed the Contributor License Agreement.

May 20 '24 20:05 ghost

@brandtbucher can you run the benchmarks on this PR?

May 20 '24 21:05 saulshanabrook

Just started the job (including stats). Should be 2-3 hours!

May 20 '24 21:05 brandtbucher

Results are in!

Performance impact is negligible with the JIT enabled. However, the stats are more interesting: if you go to the "Optimization (Tier 2) stats" section, you'll see that we're executing 1.7B fewer _GUARD_TYPE_VERSION instructions (a reduction of 25%).

I haven't dug much deeper than that, but the stats also seem to indicate that we're executing around 10% fewer traces (and micro-ops) in general. It isn't clear why this PR would cause that to happen... it could just be noise in the run, or possibly some second-order interaction due to the change.

Either way, I think the takeaway is: this seems to work correctly, but type version checks aren't really a significant source of overhead. I think it may still be worth doing, especially if we can use a similar approach to remove other checks too.

@markshannon, any thoughts here?

May 21 '24 12:05 brandtbucher

we're executing around 10% fewer traces (and micro-ops) in general. It isn't clear why this PR would cause that to happen...

I did notice that this PR breaks the hexiom and mdp benchmarks (they run on the base, but not with the changes in this PR). This kind of thing is totally expected for a big experimental change like this, but it might provide some clues about how it's not performing as expected. The logs just say "benchmark died", which usually means a segfault or something else that wouldn't produce a Python traceback.

May 22 '24 11:05 mdboom

Interesting! We’ll take a look today.

May 22 '24 14:05 brandtbucher

@mdboom thank you for flagging this! I was able to reproduce it locally with these two test files.

I notice it also segfaults on the new one https://github.com/python/cpython/pull/119365 so I will look into why and try to add a test case for it as well to the test suite in the new PR.

May 22 '24 14:05 saulshanabrook

We seemed to fix this bug: https://github.com/python/cpython/pull/119365/commits/88e2e59d3225a59b7c52eb36d1230d03c696b326

It was very weird and likely a bug in main that was only highlighted with these changes.

May 22 '24 19:05 saulshanabrook

I am closing this PR since it seems like we want to use the approach in https://github.com/python/cpython/pull/119365 to use watchers instead, which allows us to remove more guards and also opens up some future improvements that rely on type watching.

May 22 '24 20:05 saulshanabrook