cpython icon indicating copy to clipboard operation
cpython copied to clipboard

copy.copy and copy.deepcopy scale poorly with free-threading

Open eendebakpt opened this issue 9 months ago • 12 comments

The copy.copy and copy.deepcopy operations running on different objects in different threads should be able to run independently. The scaling is not good: output of the ftscalingbench.py with added benchmarks:

Running benchmarks with 8 threads
shallow_copy               4.6x slower
deepcopy                   2.3x slower
object_cfunction           5.9x faster
cmodule_function           5.9x faster
object_lookup_special      5.5x faster
mult_constant              5.6x faster
generator                  4.9x faster
pymethod                   5.2x faster
pyfunction                 4.5x faster
module_function            5.0x faster
load_string_const          5.6x faster
load_tuple_const           5.9x faster
create_pyobject            2.3x faster
create_closure             7.1x faster
create_dict                3.3x faster
thread_local_read          3.3x faster

There are at least two reasons:

  • The copy module uses module level variables in the copy.copy and copy.deepcopy methods.
  • The _copy_atomic_types (and some similar data structures) is a set which requires locking for membership testing.

Linked PRs

  • gh-132658
  • gh-132659
  • gh-136107
  • gh-132290
  • gh-138429
  • gh-141183
  • gh-141772
  • gh-141773
  • gh-142733
  • gh-142735
  • gh-142843

eendebakpt avatar Apr 17 '25 19:04 eendebakpt

I have no good approach yet to improve performance of the module level variables. This diff

diff --git a/Lib/copy.py b/Lib/copy.py
index c64fc076179..084b2873b65 100644
--- a/Lib/copy.py
+++ b/Lib/copy.py
@@ -67,7 +67,8 @@ def copy(x):

     cls = type(x)

-    if cls in _copy_atomic_types:
+    _local_atomic_types =  {int, float, bool, complex, bytes, str, type, range, property}
+    if cls in _local_atomic_types or cls in _atomic_types:
         return x
     if cls in _copy_builtin_containers:
         return cls.copy(x)

improves the scaling of copy.deepcopy a lot, but the solution is not very elegant and does not work for all test cases.

eendebakpt avatar Apr 17 '25 20:04 eendebakpt

The copy module uses module level variables in the copy.copy and copy.deepcopy methods

We've discussed making module level variables use deferred reference counting, which would address this problem, but haven't decided exactly how we want to implement it. (All global variables? Only some frequently accessed ones?)

colesbury avatar Apr 18 '25 14:04 colesbury

Ok, I will leave this open until the deferred reference counting is in place. I tried making the relevant variables (e.g. copy._copy_atomic_types immortal, but that did not seem to improve performance).

eendebakpt avatar Apr 21 '25 20:04 eendebakpt

There is a bit more going on than just refcount contention. The code

s = frozenset({1, 2, 3})
z in s

scales poorly in the FT build.

  • The reason is that the z in s is specialized to _CONTAINS_OP_SET which calls _PySet_Contains (here) which has a lock

https://github.com/python/cpython/blob/0d76dccc3b4376ba075a1737f58809e3d83aaaa3/Objects/setobject.c#L2235-L2242

  • A second reason is there might be refcount contention in set_lookkey

https://github.com/python/cpython/blob/0d76dccc3b4376ba075a1737f58809e3d83aaaa3/Objects/setobject.c#L103-L110

The increment is there because startkey is a borrowed reference and the rich compare might remove startkey from the set. For a frozenset this cannot happen, so we can skip the incref/decref. For a frozenset we can also skip the guard if (table != so->table || entry->key != startkey).

To address the issues we can either

  • Make dedicated methods for frozenset (this requires some duplication of code and a new _CONTAINS_OP_FROZENSET bytecode)
  • Add a quick check using PyFrozenSet_CheckExact at the appropriate places and act accordingly.

@colesbury @markshannon Any opinion on which way to go?

eendebakpt avatar Jun 26 '25 14:06 eendebakpt

I think adding a code path for PyFrozenSet_CheckExact that avoids the locking and incref/decref makes sense.

colesbury avatar Jun 26 '25 15:06 colesbury

The copy module uses module level variables in the copy.copy and copy.deepcopy methods

We've discussed making module level variables use deferred reference counting, which would address this problem, but haven't decided exactly how we want to implement it. (All global variables? Only some frequently accessed ones?)

@colesbury Are there any downsides to making module level attributes use deferred reference counting? With deferred reference counting objects can take longer to be garbage collected, but that does not seem an issue for module level attributes (as long as a module is not reloaded, the module attributes will remain in memory).

eendebakpt avatar Nov 15 '25 21:11 eendebakpt

The copy module uses module level variables in the copy.copy and copy.deepcopy methods

We've discussed making module level variables use deferred reference counting, which would address this problem, but haven't decided exactly how we want to implement it. (All global variables? Only some frequently accessed ones?)

The module level methods now have deferred reference counting by default (I could not find the PR enabling this), but module level attributes are not deferred by default.

Test script:

import copy
from _testinternalcapi import has_deferred_refcount
from _testcapi import is_immortal

def dummy():
    pass

print(f'{dummy}: deferred_refcount {has_deferred_refcount(dummy)} immortal {is_immortal(dummy)}' )

for item in [copy.copy, copy.deepcopy, copy._deepcopy_list, copy._deepcopy_dispatch]:
    print(f'{item}: deferred_refcount {has_deferred_refcount(item)} immortal {is_immortal(item)}' )

Output

<function dummy at 0x3968ca50780>: deferred_refcount True immortal False
<function copy at 0x3968ca55a00>: deferred_refcount True immortal False
<function deepcopy at 0x3968ca54e00>: deferred_refcount True immortal False
<function _deepcopy_list at 0x3968ca50240>: deferred_refcount True immortal False
{<class 'list'>: <function _deepcopy_list at 0x3968ca50240>, <class 'tuple'>: <function _deepcopy_tuple at 0x3968ca55b80>, <class 'dict'>: <function _deepcopy_dict at 0x3968ca55c40>, <class 'method'>: <function _deepcopy_method at 0x3968ca55d00>}: deferred_refcount False immortal False

@colesbury Is there any reason for not making the module level attributes use deferred reference counting? Situations where global level variables are replaced often seem rare.

By adding these lines to copy.py the FT scaling benchmark for copy.copy improves significantly:

import _testcapi
_testcapi.pyobject_enable_deferred_refcount(_copy_atomic_types)
_testcapi.pyobject_enable_deferred_refcount(_copy_builtin_containers)

Main:

shallow_copy               1.1x faster
shallow_copy_atomic        1.1x faster

With extra lines

shallow_copy               5.1x faster
shallow_copy_atomic        4.0x faster

There already is sys.is_immortal, we could add sys.enable_deferred_refcount/sys.has_deferred_refcount (or private versions) and use that to make the copy._copy_atomic_types use deferred reference counting.

eendebakpt avatar Nov 16 '25 20:11 eendebakpt

@colesbury Perhaps this can be closed now? Results below are for 8 threads after merging https://github.com/python/cpython/pull/132290 . The modified ftscaling script is linked from that PR. Maybe we want to revise the benchmark before closing this.

FT scaling benchmark 3.14.2t main
dict_contains 7.6x faster 7.9x faster
tuple_contains 7.7x faster 7.8x faster
list_contains 7.6x faster 7.6x faster
frozenset_contains 7.5x faster 7.8x faster
frozenset_contains_dunder 7.7x faster 7.8x faster
set_contains 4.8x slower 7.5x faster
set_contains_dunder 3.4x slower 7.7x faster
shallow_copy 2.3x slower 1.2x faster
deepcopy 1.6x slower 3.0x faster

nascheme avatar Dec 15 '25 19:12 nascheme

@nascheme - that seems improved, but it looks like there are still scaling bottlenecks. I guess they're from reference count contention when loading the same global variable in multiple threads?

colesbury avatar Dec 15 '25 19:12 colesbury

@nascheme @colesbury There is still contention on the module level attributes. By addressing that the scaling can be improved:

Main:

shallow_copy_atomic_type   1.6x slower
shallow_copy_list          1.2x faster
deepcopy                   2.6x faster
deepcopy_dataclass         3.3x faster

PR

shallow_copy_atomic_type   5.8x faster
shallow_copy_list          4.7x faster
deepcopy                   4.4x faster
deepcopy_dataclass         3.0x faster

This improvement is by adding the following 5 lines to the copy module:

import testcapi
testcapi.pyobject_enable_deferred_refcount(_copy_atomic_types)
_testcapi.pyobject_enable_deferred_refcount(_copy_builtin_containers)
_testcapi.pyobject_enable_deferred_refcount(_atomic_types)
_testcapi.pyobject_enable_deferred_refcount(_deepcopy_dispatch)

I do not really like the solution, as we could end up with many of these statemements in the codebase. (I would be in favor of adding pyobject_enable_deferred_refcount to sys , but there is issue https://github.com/python/cpython/issues/134819 for that already)

We could also make all module level attributes use deferred reference counting. A branch to test this is

https://github.com/python/cpython/compare/main...eendebakpt:cpython:module_attributes_deferred_ref_counting

(the implementation needs some work though)

eendebakpt avatar Dec 15 '25 22:12 eendebakpt

Maybe a crazy idea but what about something like the following. The idea is when we specialize to LOAD_GLOBAL_MODULE we also enable deferred ref counting for the value.

--- a/Python/specialize.c
+++ b/Python/specialize.c
@@ -1305,6 +1305,13 @@ specialize_load_global_lock_held(
             SPECIALIZATION_FAIL(LOAD_GLOBAL, SPEC_FAIL_OUT_OF_RANGE);
             goto fail;
         }
+#ifdef Py_GIL_DISABLED
+        PyObject *value;
+        if (PyDict_GetItemRef(globals, name, &value) == 1) {
+            PyUnstable_Object_EnableDeferredRefcount(value);
+            Py_DECREF(value);
+        }
+#endif
         cache->index = (uint16_t)index;
         cache->module_keys_version = (uint16_t)keys_version;
         specialize(instr, LOAD_GLOBAL_MODULE);

With this change, the copy/deepcopy benchmarks scale quite a bit better:

shallow_copy               5.6x faster
deepcopy                   6.2x faster

Better change, still quick and dirty: https://github.com/python/cpython/pull/142843 It seems we only need to do it for frozensets, at least to help the copy scaling.

nascheme avatar Dec 16 '25 19:12 nascheme

I think something like that makes sense. The downside is that it means that the value won't get freed immediately if someone overwrites the global variable -- it will only get freed during GC. We might want to limit it to globals that are accessed by another thread.

@nascheme - do you want to open a PR that we can use as a basis for discussion and investigation?

colesbury avatar Dec 16 '25 22:12 colesbury