cpython Tier2 peephole optimization: remove extraneous _CHECK_STACK

Implement a new peephole optimization for the tier2 optimizer that removes _CHECK_STACK_SPACE if we see that this check is already present

Feb 29 '24 23:02 lazorchakp

cc @Fidget-Spinner

This might need some labels. I also looked around for a parent ticket, but didn't find anything particularly obvious

Feb 29 '24 23:02 lazorchakp

I will link to https://github.com/python/cpython/issues/115419

Mar 01 '24 08:03 Fidget-Spinner

I don't know anything about the plan here, so this may not be news. :-) Couldn't you check for the needed stack space once at the beginning of the executor? And if there's a loop, lift it out of there. (And if it fits within the stack space of the code object already, drop it altogether.)

Mar 03 '24 21:03 gvanrossum

Makes sense. Let's go with consolidating them first then another follow-up PR for lifting. Lifting stuff out of loops requires _JUMP_ABSOLUTE and I can just point to the code from my old PR to copy. (Though the heuristics need someone else to work on it).

Mar 03 '24 21:03 Fidget-Spinner

Is anyone working on this? Otherwise I'd like to give it a go.

Mar 20 '24 16:03 gvanrossum

@gvanrossum I think @lazorchakp might be working on this, @lazorchakp do you mind if Guido takes over? If you have something with an implementation up but no tests that's okay, you can just submit a draft PR and Guido can probably take over if you don't mind.

Mar 20 '24 16:03 Fidget-Spinner

I'd also be happy to mentor @lazorchakp !

Mar 20 '24 16:03 gvanrossum

Sorry everyone - I've been busier than anticipated and haven't made concrete progress on this yet. If it's alright though, I'd still like to try it, and I should be able to get a draft up tonight if that works. @Fidget-Spinner has given me some guidance over email, but I'd gladly accept as much mentoring as you both are willing to provide 🙂

That being said, @gvanrossum if you'd rather go for it yourself, that's quite alright! Just let me know.

Mar 20 '24 17:03 lazorchakp

No, I'd love to have you do it!

My guess is that the critical trick should be to change the uop that checks for stack space to take a number indicating the amount of stack space needed, rather than digging that value out of the code object (which must be retrieved from the function object, which must be dug out of the stack at position -2-oparg, IIRC).

Maybe this number can be put in its operand field by the initial trace projection pass (which needs the code object anyway). Then a later dedicated pass can combine stack checks as long as they are within the same "scope" (i.e., _PUSH_FRAME/_POP_FRAME pair).

Mar 20 '24 17:03 gvanrossum

That sounds great. It makes sense to me that we'd now need _CHECK_STACK_SPACE to know up-front how much space it's looking for. I'll see if I can take your suggestion and modify the uop so that this information is available for the optimization pass.

Mar 20 '24 17:03 lazorchakp

I'm at the point where I have the operand for each _CHECK_STACK_SPACE being correctly assigned to its corresponding function's framesize (pushed a rough branch). Before I implement the actual optimization though, I want to make sure I'm reasoning about this correctly - I'm making some heavy assumptions about the underlying machinery. Are the following cases correct? (other ops removed for simplicity):

CASE 1 - sequential function calls:

_CHECK_STACK_SPACE A
_PUSH_FRAME
_POP_FRAME
_CHECK_STACK_SPACE B
_PUSH_FRAME
_POP_FRAME

proposed optimization:

_CHECK_STACK_SPACE max(A, B)
_PUSH_FRAME
_POP_FRAME
_PUSH_FRAME
_POP_FRAME

CASE 2 - nested function calls:

_CHECK_STACK_SPACE A
_PUSH_FRAME
_CHECK_STACK_SPACE B
_PUSH_FRAME
_POP_FRAME
_POP_FRAME

proposed optimization:

_CHECK_STACK_SPACE A + B
_PUSH_FRAME
_PUSH_FRAME
_POP_FRAME
_POP_FRAME

If these cases are correct, I should be able to combine everything into a single _CHECK_STACK_SPACE uop per trace, as @gvanrossum mentioned above.

Mar 21 '24 05:03 lazorchakp

Yes, both cases look right.

Mar 21 '24 15:03 gvanrossum

@gvanrossum / @Fidget-Spinner

I have this mostly working, but I'm stuck on one critical piece: actually using the new _CHECK_STACK_SPACE operand (the amount of space to check) at execution time. I think I need to update the definition of _CHECK_STACK_SPACE in bytecodes.c. My fear is that I'll only have the operand in tier2 after trace projection, so tier1 and tier2 behavior would have to differ.

Does anyone have ideas about how I might be able to modify _CHECK_STACK_SPACE so that for tier2 (executor_cases.c.h...?) it will use the operand populated during trace projection but for tier1 (generated_cases.c.h...?) it will check the value on the stack? If the behavior must be identical, do I need to modify the bytecode cache in the specializer too? (It's entirely possible that these questions don't make sense or that I have some fundamental misunderstanding)

Mar 22 '24 08:03 lazorchakp

@lazorchakp if the instruction will not fit the format in tier 1, you can add a new uop _CHECK_STACK_SPACE_COMBINED or something. Then manually swap out for that uop in the optimizer analysis.

Mar 22 '24 10:03 Fidget-Spinner

Tier 1 should be unaffected. IIRC there are #ifdefs you can check for.

Mar 22 '24 16:03 gvanrossum

I don't know what the "eyes" reaction means. Do you need more info? If so please ask another question. :-)

Mar 22 '24 23:03 gvanrossum

Ah sorry, didn't want to generate too much noise - I'll be more clear in the future. I just restarted work on this for the evening and these comments are enough to unblock me for now. I won't hesitate to ask another question if I get stuck somewhere!

Mar 22 '24 23:03 lazorchakp

I'm getting ready to open the PR here and am wrapping up some final tests. One of my new tests fails on main though; is opening a new issue appropriate? I haven't had time to investigate yet.

This is it, if anyone's interested (under test_opt.TestUopsOptimziation)

    def test_many_nested(self)
        # overflow the trace_stack
        def dummy_a(x):
            return x
        def dummy_b(x):
            return dummy_a(x)
        def dummy_c(x):
            return dummy_b(x)
        def dummy_d(x):
            return dummy_c(x)
        def dummy_e(x):
            return dummy_d(x)
        def dummy_f(x):
            return dummy_e(x)
        def dummy_g(x):
            return dummy_f(x)
        def dummy_h(x):
            return dummy_g(x)
        def testfunc(n):
            a = 0
            for _ in range(n):
                a += dummy_h(n)
            return a

        self._run_with_optimizer(testfunc, 32)

error:

Assertion failed: ((this_instr + 2)->opcode == _PUSH_FRAME), function optimize_uops, file optimizer_cases.c.h, line 1600.
Fatal Python error: Aborted

Current thread 0x000000020140a100 (most recent call first):
  File "..../cpython/Lib/test/test_capi/test_opt.py", line 978 in testfunc
  File "..../cpython/Lib/test/test_capi/test_opt.py", line 588 in _run_with_optimizer
  ...

Mar 23 '24 05:03 lazorchakp

Seems like a bug. Please open an issue with the repro. I will fix it. Thanks!

Mar 23 '24 06:03 Fidget-Spinner

@Fidget-Spinner I opened #117180. Thanks, and lmk if you need more details.

Mar 23 '24 13:03 lazorchakp

Still have to iterate with the pipeline a bit in my draft PR. I'll move it to Open and send an update here when it's officially ready for review.

Mar 26 '24 04:03 lazorchakp

Just opened #117242 for review

Mar 28 '24 04:03 lazorchakp

This is done -- thanks again!

Apr 04 '24 18:04 gvanrossum

Tier2 peephole optimization: remove extraneous _CHECK_STACK_SPACE ops