copy.copy and copy.deepcopy scale poorly with free-threading
The copy.copy and copy.deepcopy operations running on different objects in different threads should be able to run independently. The scaling is not good: output of the ftscalingbench.py with added benchmarks:
Running benchmarks with 8 threads
shallow_copy 4.6x slower
deepcopy 2.3x slower
object_cfunction 5.9x faster
cmodule_function 5.9x faster
object_lookup_special 5.5x faster
mult_constant 5.6x faster
generator 4.9x faster
pymethod 5.2x faster
pyfunction 4.5x faster
module_function 5.0x faster
load_string_const 5.6x faster
load_tuple_const 5.9x faster
create_pyobject 2.3x faster
create_closure 7.1x faster
create_dict 3.3x faster
thread_local_read 3.3x faster
There are at least two reasons:
- The
copymodule uses module level variables in thecopy.copyandcopy.deepcopymethods. - The
_copy_atomic_types(and some similar data structures) is asetwhich requires locking for membership testing.
Linked PRs
- gh-132658
- gh-132659
- gh-136107
- gh-132290
- gh-138429
- gh-141183
- gh-141772
- gh-141773
- gh-142733
- gh-142735
- gh-142843
I have no good approach yet to improve performance of the module level variables. This diff
diff --git a/Lib/copy.py b/Lib/copy.py
index c64fc076179..084b2873b65 100644
--- a/Lib/copy.py
+++ b/Lib/copy.py
@@ -67,7 +67,8 @@ def copy(x):
cls = type(x)
- if cls in _copy_atomic_types:
+ _local_atomic_types = {int, float, bool, complex, bytes, str, type, range, property}
+ if cls in _local_atomic_types or cls in _atomic_types:
return x
if cls in _copy_builtin_containers:
return cls.copy(x)
improves the scaling of copy.deepcopy a lot, but the solution is not very elegant and does not work for all test cases.
The copy module uses module level variables in the copy.copy and copy.deepcopy methods
We've discussed making module level variables use deferred reference counting, which would address this problem, but haven't decided exactly how we want to implement it. (All global variables? Only some frequently accessed ones?)
Ok, I will leave this open until the deferred reference counting is in place. I tried making the relevant variables (e.g. copy._copy_atomic_types immortal, but that did not seem to improve performance).
There is a bit more going on than just refcount contention. The code
s = frozenset({1, 2, 3})
z in s
scales poorly in the FT build.
- The reason is that the
z in sis specialized to_CONTAINS_OP_SETwhich calls_PySet_Contains(here) which has a lock
https://github.com/python/cpython/blob/0d76dccc3b4376ba075a1737f58809e3d83aaaa3/Objects/setobject.c#L2235-L2242
- A second reason is there might be refcount contention in
set_lookkey
https://github.com/python/cpython/blob/0d76dccc3b4376ba075a1737f58809e3d83aaaa3/Objects/setobject.c#L103-L110
The increment is there because startkey is a borrowed reference and the rich compare might remove startkey from the set. For a frozenset this cannot happen, so we can skip the incref/decref. For a frozenset we can also skip the guard if (table != so->table || entry->key != startkey).
To address the issues we can either
- Make dedicated methods for frozenset (this requires some duplication of code and a new _CONTAINS_OP_FROZENSET bytecode)
- Add a quick check using
PyFrozenSet_CheckExactat the appropriate places and act accordingly.
@colesbury @markshannon Any opinion on which way to go?
I think adding a code path for PyFrozenSet_CheckExact that avoids the locking and incref/decref makes sense.
The copy module uses module level variables in the copy.copy and copy.deepcopy methods
We've discussed making module level variables use deferred reference counting, which would address this problem, but haven't decided exactly how we want to implement it. (All global variables? Only some frequently accessed ones?)
@colesbury Are there any downsides to making module level attributes use deferred reference counting? With deferred reference counting objects can take longer to be garbage collected, but that does not seem an issue for module level attributes (as long as a module is not reloaded, the module attributes will remain in memory).
The copy module uses module level variables in the copy.copy and copy.deepcopy methods
We've discussed making module level variables use deferred reference counting, which would address this problem, but haven't decided exactly how we want to implement it. (All global variables? Only some frequently accessed ones?)
The module level methods now have deferred reference counting by default (I could not find the PR enabling this), but module level attributes are not deferred by default.
Test script:
import copy
from _testinternalcapi import has_deferred_refcount
from _testcapi import is_immortal
def dummy():
pass
print(f'{dummy}: deferred_refcount {has_deferred_refcount(dummy)} immortal {is_immortal(dummy)}' )
for item in [copy.copy, copy.deepcopy, copy._deepcopy_list, copy._deepcopy_dispatch]:
print(f'{item}: deferred_refcount {has_deferred_refcount(item)} immortal {is_immortal(item)}' )
Output
<function dummy at 0x3968ca50780>: deferred_refcount True immortal False
<function copy at 0x3968ca55a00>: deferred_refcount True immortal False
<function deepcopy at 0x3968ca54e00>: deferred_refcount True immortal False
<function _deepcopy_list at 0x3968ca50240>: deferred_refcount True immortal False
{<class 'list'>: <function _deepcopy_list at 0x3968ca50240>, <class 'tuple'>: <function _deepcopy_tuple at 0x3968ca55b80>, <class 'dict'>: <function _deepcopy_dict at 0x3968ca55c40>, <class 'method'>: <function _deepcopy_method at 0x3968ca55d00>}: deferred_refcount False immortal False
@colesbury Is there any reason for not making the module level attributes use deferred reference counting? Situations where global level variables are replaced often seem rare.
By adding these lines to copy.py the FT scaling benchmark for copy.copy improves significantly:
import _testcapi
_testcapi.pyobject_enable_deferred_refcount(_copy_atomic_types)
_testcapi.pyobject_enable_deferred_refcount(_copy_builtin_containers)
Main:
shallow_copy 1.1x faster
shallow_copy_atomic 1.1x faster
With extra lines
shallow_copy 5.1x faster
shallow_copy_atomic 4.0x faster
There already is sys.is_immortal, we could add sys.enable_deferred_refcount/sys.has_deferred_refcount (or private versions) and use that to make the copy._copy_atomic_types use deferred reference counting.
@colesbury Perhaps this can be closed now? Results below are for 8 threads after merging https://github.com/python/cpython/pull/132290 . The modified ftscaling script is linked from that PR. Maybe we want to revise the benchmark before closing this.
| FT scaling benchmark | 3.14.2t | main |
|---|---|---|
| dict_contains | 7.6x faster | 7.9x faster |
| tuple_contains | 7.7x faster | 7.8x faster |
| list_contains | 7.6x faster | 7.6x faster |
| frozenset_contains | 7.5x faster | 7.8x faster |
| frozenset_contains_dunder | 7.7x faster | 7.8x faster |
| set_contains | 4.8x slower | 7.5x faster |
| set_contains_dunder | 3.4x slower | 7.7x faster |
| shallow_copy | 2.3x slower | 1.2x faster |
| deepcopy | 1.6x slower | 3.0x faster |
@nascheme - that seems improved, but it looks like there are still scaling bottlenecks. I guess they're from reference count contention when loading the same global variable in multiple threads?
@nascheme @colesbury There is still contention on the module level attributes. By addressing that the scaling can be improved:
Main:
shallow_copy_atomic_type 1.6x slower
shallow_copy_list 1.2x faster
deepcopy 2.6x faster
deepcopy_dataclass 3.3x faster
PR
shallow_copy_atomic_type 5.8x faster
shallow_copy_list 4.7x faster
deepcopy 4.4x faster
deepcopy_dataclass 3.0x faster
This improvement is by adding the following 5 lines to the copy module:
import testcapi
testcapi.pyobject_enable_deferred_refcount(_copy_atomic_types)
_testcapi.pyobject_enable_deferred_refcount(_copy_builtin_containers)
_testcapi.pyobject_enable_deferred_refcount(_atomic_types)
_testcapi.pyobject_enable_deferred_refcount(_deepcopy_dispatch)
I do not really like the solution, as we could end up with many of these statemements in the codebase. (I would be in favor of adding pyobject_enable_deferred_refcount to sys , but there is issue https://github.com/python/cpython/issues/134819 for that already)
We could also make all module level attributes use deferred reference counting. A branch to test this is
https://github.com/python/cpython/compare/main...eendebakpt:cpython:module_attributes_deferred_ref_counting
(the implementation needs some work though)
Maybe a crazy idea but what about something like the following. The idea is when we specialize to LOAD_GLOBAL_MODULE we also enable deferred ref counting for the value.
--- a/Python/specialize.c
+++ b/Python/specialize.c
@@ -1305,6 +1305,13 @@ specialize_load_global_lock_held(
SPECIALIZATION_FAIL(LOAD_GLOBAL, SPEC_FAIL_OUT_OF_RANGE);
goto fail;
}
+#ifdef Py_GIL_DISABLED
+ PyObject *value;
+ if (PyDict_GetItemRef(globals, name, &value) == 1) {
+ PyUnstable_Object_EnableDeferredRefcount(value);
+ Py_DECREF(value);
+ }
+#endif
cache->index = (uint16_t)index;
cache->module_keys_version = (uint16_t)keys_version;
specialize(instr, LOAD_GLOBAL_MODULE);
With this change, the copy/deepcopy benchmarks scale quite a bit better:
shallow_copy 5.6x faster
deepcopy 6.2x faster
Better change, still quick and dirty:
https://github.com/python/cpython/pull/142843
It seems we only need to do it for frozensets, at least to help the copy scaling.
I think something like that makes sense. The downside is that it means that the value won't get freed immediately if someone overwrites the global variable -- it will only get freed during GC. We might want to limit it to globals that are accessed by another thread.
@nascheme - do you want to open a PR that we can use as a basis for discussion and investigation?