cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-118926: Deferred reference counting GC changes for free threading

Open Fidget-Spinner opened this issue 1 year ago • 6 comments

This PR mainly introduces GC changes to the free threading GC to support deferred reference counting in the future.

To get this to work, new stack references must immediately live on the stack, without any interfering Py_DECREF or escaping code between when we put them on the stack. This ensures they are visible to the GC.

This also NULLs out the rest of the stack because the GC scans the entire stack. The stack pointer may be inconsistent between escaping calls (including those to Py_DECREF) so we need to scan the whole stack.

This PR removes the temporary immortalization introduced previously in #117783. I wanted to do this in a separate PR, but the only way to test this properly is to remove that hack. So it has to be bundled in this PR.

Finally, this PR fixes a few bugs in steals and borrows. This was only caught by the GC changes, not by the debugger that I was working on. Since these are untestable without the GC changes, I bundled them in.

Temporary perf regressions (6 threads):

object_cfunction PASSED: 3.6x faster
cmodule_function FAILED: 2.1x faster (expected: LOAD_ATTR from dict)
generator PASSED: 3.7x faster
pymethod FAILED: 2.5x faster (expected: LOAD_ATTR from pytype_lookup)
pyfunction FAILED: 2.9x faster (expected: LOAD_GLOBAL)
module_function FAILED: 2.9x faster (expected: LOAD_ATTR from module dict)
load_string_const PASSED: 4.0x faster
load_tuple_const PASSED: 3.8x faster
create_closure FAILED: 2.3x faster (unsure why)
create_pyobject FAILED: 1.5x faster (expected: LOAD_GLOBAL)
  • Issue: gh-118926

Fidget-Spinner avatar Jul 03 '24 10:07 Fidget-Spinner

!buildbot nogil

Fidget-Spinner avatar Jul 03 '24 10:07 Fidget-Spinner

:robot: New build scheduled with the buildbot fleet by @Fidget-Spinner for commit 8cb139f4ad78cce6e34b63f441cbc18bb5e40da4 :robot:

The command will test the builders whose names match following regular expression: nogil

The builders matched are:

  • AMD64 Ubuntu NoGIL PR
  • aarch64 Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel ASAN NoGIL PR
  • AMD64 Ubuntu NoGIL Refleaks PR
  • AMD64 Fedora Rawhide NoGIL refleaks PR
  • aarch64 Fedora Rawhide NoGIL refleaks PR
  • PPC64LE Fedora Rawhide NoGIL refleaks PR
  • AMD64 Windows Server 2022 NoGIL PR
  • ARM64 MacOS M1 Refleaks NoGIL PR
  • PPC64LE Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel NoGIL PR
  • ARM64 MacOS M1 NoGIL PR
  • AMD64 Fedora Rawhide NoGIL PR

bedevere-bot avatar Jul 03 '24 10:07 bedevere-bot

!buildbot nogil

Fidget-Spinner avatar Jul 03 '24 12:07 Fidget-Spinner

:robot: New build scheduled with the buildbot fleet by @Fidget-Spinner for commit 700c2fdc0708df987a3d524831dfdd34c0ddeb0f :robot:

The command will test the builders whose names match following regular expression: nogil

The builders matched are:

  • AMD64 Ubuntu NoGIL PR
  • aarch64 Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel ASAN NoGIL PR
  • AMD64 Ubuntu NoGIL Refleaks PR
  • AMD64 Fedora Rawhide NoGIL refleaks PR
  • aarch64 Fedora Rawhide NoGIL refleaks PR
  • PPC64LE Fedora Rawhide NoGIL refleaks PR
  • AMD64 Windows Server 2022 NoGIL PR
  • ARM64 MacOS M1 Refleaks NoGIL PR
  • PPC64LE Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel NoGIL PR
  • ARM64 MacOS M1 NoGIL PR
  • AMD64 Fedora Rawhide NoGIL PR

bedevere-bot avatar Jul 03 '24 12:07 bedevere-bot

!buildbot nogil

Fidget-Spinner avatar Jul 03 '24 13:07 Fidget-Spinner

:robot: New build scheduled with the buildbot fleet by @Fidget-Spinner for commit cde931dd6cab4d17c7d61ad1d00a959b57b62ebd :robot:

The command will test the builders whose names match following regular expression: nogil

The builders matched are:

  • AMD64 Ubuntu NoGIL PR
  • aarch64 Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel ASAN NoGIL PR
  • AMD64 Ubuntu NoGIL Refleaks PR
  • AMD64 Fedora Rawhide NoGIL refleaks PR
  • aarch64 Fedora Rawhide NoGIL refleaks PR
  • PPC64LE Fedora Rawhide NoGIL refleaks PR
  • AMD64 Windows Server 2022 NoGIL PR
  • ARM64 MacOS M1 Refleaks NoGIL PR
  • PPC64LE Fedora Rawhide NoGIL PR
  • x86-64 MacOS Intel NoGIL PR
  • ARM64 MacOS M1 NoGIL PR
  • AMD64 Fedora Rawhide NoGIL PR

bedevere-bot avatar Jul 03 '24 13:07 bedevere-bot

This was only caught by the GC changes, not by the debugger that I was working on.

OOI, do you know why this was?

markshannon avatar Jul 08 '24 10:07 markshannon

Why not use the approach prototyped in #119875? (But, rather than add explicit SAVE_SP -- LOAD_SP pairs, wrap the calls in an ESCAPING_CALL macro)

I would like to, but we'd need to wrap anything with Py_DECREF too, and that's too annoying.

Fidget-Spinner avatar Jul 08 '24 11:07 Fidget-Spinner

It doesn't matter what is "annoying". It needs to be correct.

Py_DECREF doesn't escape, it is Py_Dealloc that escapes. So the ESCAPING_CALL can be embedded in a interpreter-only version of the Py_DECREF macro.

markshannon avatar Jul 08 '24 13:07 markshannon

a interpreter-only version of thePy_DECREF` macro.

Sorry I don't quite understand what you mean here. do you mean we can re-define Py_DECREF for the interpreter loop (via the code generator or otherwise)? If so that sounds good to me, I can work on that in this PR if you'd like.

Fidget-Spinner avatar Jul 08 '24 13:07 Fidget-Spinner

Something like:

#define INTERPRETER_DECREF(op) \
    if (--op->ob_refcnt == 0) { \
        ESCAPING_CALL(Py_Dealloc(op)); \
    }

(plus debug and free-threading variants)

markshannon avatar Jul 08 '24 14:07 markshannon

Alright. I can work on porting your PR over to the new branch. Would you like me to?

Fidget-Spinner avatar Jul 08 '24 14:07 Fidget-Spinner

This was only caught by the GC changes, not by the debugger that I was working on.

OOI, do you know why this was?

Because these values escape to PyObject, which we don't track.

Fidget-Spinner avatar Jul 09 '24 10:07 Fidget-Spinner

@markshannon I addressed the DSL changes. Do you have any other comments? This PR does not depend on spilling the stack pointer. Though that would be useful for other reasons, I would like for that to be a separate PR though.

Fidget-Spinner avatar Jul 10 '24 14:07 Fidget-Spinner

I don't like this change. Having to add magical [1] suffices for reasons that aren't at all clear without knowing the internals of the nogil GC is going to be a major maintenance headache.

Looking at inst(UNPACK_SEQUENCE_TWO_TUPLE, (unused/1, seq -- val1[1], val0[1])) as a maintainer, I would want to clean it up to inst(UNPACK_SEQUENCE_TWO_TUPLE, (unused/1, seq -- val1, val0)). If I did that, the only clue as to what went wrong would be mysterious crashes in the nogil build.

bytecodes.c is used as input to two interpreter generators, the JIT generator, at least one abstract interpreter, and various metadata generators. It needs to be clear. All necessary information should be expressed explicitly, and, ideally, be checkable.

And please don't mark comments as "resolved" when they aren't.

markshannon avatar Jul 15 '24 09:07 markshannon

We don't want a pointer to a specific stack location.

Why do you say that? We want a pointer to the location of where value (the result) is stored on the stack.

Because it is the job of the code generators to manage the stack. We shouldn't be guessing where the code generator is physically storing a particular value.

What are you referring to here? The stack is currently always in memory.

If that were the case, there wouldn't be a need for the [1] suffices. The top of the stack is kept in registers (C locals) between uops in tier 1, and will soon be kept in registers between uops in tier 2.

markshannon avatar Jul 15 '24 09:07 markshannon

And please don't mark comments as "resolved" when they aren't.

Sorry my bad, I thought they were.

Ok I kind of get the problem you're explaining. So we would need a solution that:

  1. Is clear in the bytecode DSL.
  2. Does not need exposing a pointer at all to the instruction body. So no *val0 = thing. Preferrably you want val0 = thing instead. This is to ensure compatibility with TOS caching when those are put in explicit registers.

In that case I propose the following solution:

  1. A new annotation, either write_to_stack val0 or &val0 or flush val0.
  2. The code generator for tier 1 will see the annotation, and for every assignment to the value, it will also write it directly to the stack after, via parsing the instruction body. E.g. if we have the following code:
foo(bar1, -- flush val0) {
    val0 = op(bar1);
    Py_DECREF(bar1);
}

the codegen will produce for tier 1:

TARGET(foo) {
    PyObject *bar1, *bar0;
    bar1 = stack_pointer[-1];
    val0 = op(bar1);
    stack_pointer[-1] = val0;
    Py_DECREF(bar1);
}

For tier 2 register version, it will produce:

TARGET(foo) {
    PyObject *reg1, *reg0;
    reg1 = stack_pointer[-1];
    reg0 = op(bar1);
    stack_pointer[-1] = reg0;
    Py_DECREF(reg1);
}

So both tier 1 and tier 2 are happy. Does that sound good to you?

Fidget-Spinner avatar Jul 15 '24 10:07 Fidget-Spinner

A new annotation, either write_to_stack val0 or &val0 or flush val0.

There two problems with explicitly flushing values:

  • It isn't clear why the value needs to be flushed.
  • There is no way to check that the annotations are in the correct place.

Flushing cached values to the stack, or not, should be the job of the code generator(s). The instruction/uop definitions should declare the semantics not the desired generated code, if possible.

It might be quite a lot of work, but I really think we should mark any escaping calls, so that the code generator knows what escapes. DECREF_INPUTS is already handled by the code generator, so no changes to the code should be necessary in bytecodes.c regarding reference count operations.

markshannon avatar Jul 15 '24 13:07 markshannon

One possibility to keep things moving for the free-threading build is to generate a different generated_cases.h for the free-threading interpreter. For that case, you could write all values to the stack before any decrefs or escaping calls.

markshannon avatar Jul 15 '24 13:07 markshannon

It might be quite a lot of work, but I really think we should mark any escaping calls, so that the code generator knows what escapes. DECREF_INPUTS is already handled by the code generator, so no changes to the code should be necessary in bytecodes.c regarding reference count operations.

Ok I can port your old PR over to this one. I presume we're flushing the values to the stack before every escaping call?

Fidget-Spinner avatar Jul 15 '24 13:07 Fidget-Spinner

@markshannon, writing all values to the stack is going to introduce an unnecessary performance regression.

This is holding up the free-threaded work for what seems like small aesthetic complaints. If the concern is about maintainer confusion, we can add comments or otherwise document the dozen or so places where this is used.

colesbury avatar Jul 15 '24 14:07 colesbury

This is not just aesthetics. Maintainability is important. By degrading maintainability, you are making more work for others. Particularly for my team.

You say that "writing all values to the stack will by slow". Only for those parts of the stack that are in memory. Which is why it is important to leave the choice of which parts of the stack to spill to memory and which parts to spill to the stack up to the code generator, as it will differ for different tiers and different platforms.

the dozen or so places where this is used.

There are a hundred or more places where execution can escape from the interpreter, not counting DECREFs which add hundreds more. It seems unlikely that this PR correctly identifies all cases where a garbage collection can occur and correctly spills all the necessary values to the stack memory.

Even if it is correct, it stills lays traps for the unwary. For example, _BINARY_OP contains the code:

    DECREF_INPUTS();
    ERROR_IF(res_o == NULL, error);
    res = PyStackRef_FromPyObjectSteal(res_o);

If this were changed to

    res = PyStackRef_FromPyObjectSteal(res_o);
    DECREF_INPUTS();
    ERROR_IF(PyStackRef_IsNull(res), error);

then it would be unsafe, yet there would be no warning or error from any tools.

markshannon avatar Jul 15 '24 15:07 markshannon

The rewritten _BINARY_OP example is safe because it uses PyStackRef_FromPyObjectSteal. The concern is with calls to PyStackRef_FromPyObjectNew.

There are a hundred or more places where execution can escape from the interpreter

We should not be thinking about this in terms of where calls escape from the interpreter. We should be thinking about this in terms of where PyStackRef_FromPyObjectNew() calls occur, and ensure that those are always written to the stack.

If you want to refactor PyStackRef_FromPyObjectNew so that it takes a pointer or similar that's fine.

colesbury avatar Jul 15 '24 16:07 colesbury

Ok I can port your old PR over to this one. I presume we're flushing the values to the stack before every escaping call?

Sorry I forgot that this may introduce another perf regression on the free-threaded build, even if I limit it just to that. So I'm putting a pause on this. I think we should discuss this on Wednesday.

Fidget-Spinner avatar Jul 15 '24 17:07 Fidget-Spinner

In general, stackrefs must be spilled to the in-memory stack around any escaping call. To do this automatically, we need to:

  1. Identify all escaping calls
  2. Track assignments to output values
  3. Raise an error if an escaping call occurs unless all or none of the output values have been assigned
  4. Generate the spill around the call.
    • If none of the output values have been defined, we just spill the stack pointer after popping the inputs.
    • If all of the output values have been defined, we save the outputs to memory and save the stack pointer before making the escaping call.

Identifying all escaping calls.

We have a whitelist, all other calls are escaping. The list needs updating, but it's mostly right.

Track assignments to output values

Assignments are easy to spot name = .... Then we need do is see if name is an output variable. Any assignments in nested code, or multiple assignments, should be treated as an error.

It is easy enough to change the code move assignments out of branches, and don't want the code gen to have to do flow analysis.

Generating spills around calls.

Once we having identified the escaping call, we will need to walk backwards and forwards around the call to identify the whole statement. Once we've done that we need to emit the spill before and the reload after (if any reload is needed)

markshannon avatar Jul 19 '24 16:07 markshannon

In case that sounds too expensive, don't worry. It shouldn't be.

Spilling the stack pointer is cheap as registers often need to be spilled across calls anyway. We need to save the output values to memory in most cases anyway, we are just doing it a little earlier.

Although escaping calls are common in bytecodes.c, there are not so common dynamically. Dynamically, only about 10% of instructions include an escaping call and for many of those the additional cost is negligible for the reasons given above.

markshannon avatar Jul 19 '24 16:07 markshannon

@markshannon, that is not what we discussed and agreed to ~~yesterday~~ Wednesday. What we discussed was tracking PyStackRef_FromPyObjectNew calls from the code generator and spilling those immediately.

  • We do not need or want to track stackrefs in general for the GC, just PyStackRef_FromPyObjectNew
  • There's no point in looking for spilling calls and waiting to write the result of PyStackRef_FromPyObjectNew in multiple places.

colesbury avatar Jul 19 '24 16:07 colesbury

Tracking just PyStackRef_FromPyObjectNew will solve your immediate problem, but it doesn't help with a broader deferred reference counting implementation nor with top-of-stack caching.

I think what I proposed handles all the use cases, without a significant performance impact. Even if it turns out the performance impact is too high, the additional analysis with help make any solution more robust.

There's no point in looking for spilling calls and waiting to write the result of PyStackRef_FromPyObjectNew in multiple places.

Why would anything get written multiple times?

markshannon avatar Jul 20 '24 09:07 markshannon

I think the original goal was to not block TOS caching and full deferred refcounting. With the latest changes, this is now true. The added goal of laying the foundations for the two should be left as another PR. The responsibility of this PR IMO, is to not burden stack caching and full deferred refcounting (which it should have achieved with the latest commit). Supporting more features is out of scope.

Fidget-Spinner avatar Jul 21 '24 12:07 Fidget-Spinner

@markshannon - I'm most concerned because I thought we were on the same page on this approach after our last meeting.

I don't think starting with this approach precludes future changes, like what you've outlined above. This PR is a bottleneck for a lot of the free-threading deferred reference counting work, so I'd appreciate it if we can figure out how to unblock it.

colesbury avatar Jul 24 '24 18:07 colesbury