Introduce PyType_GetBaseByToken function
Reference implementation of the following C-API functinons:
PyType_GetBaseByToken()- ~
PyType_GetToken()~
Discussion: https://discuss.python.org/t/55598
📚 Documentation preview 📚: https://cpython-previews--121079.org.readthedocs.build/
Regarding PyType_GetToken() vs PyType_GetSlot(), the following show performance differences on the experiment commit at 4fb9cf80a9e71539e0c01b579c4218c9149f78f1 (now removed), which were noisy:
from timeit import timeit
setup = (
'import _testcapi, _datetime, _decimal;'
'A = _testcapi.create_type_with_token("_testcapi.A", 0);'
'm = (_datetime, _decimal)[0]'
)
depth = 10 # test with ncall chain rather than nloop in C level
t1 = timeit(stmt1 := f'm.test_perf_gettoken(A, {depth})', setup)
t2 = timeit(stmt2 := f'm.test_perf_getslot (A, {depth})', setup)
print(stmt1, t1)
print(stmt2, t2, t2 / t1)
- Win non-debug
| module | func | inline | result |
|---|---|---|---|
| _datetime (builtin) | GetToken | yes | 1.00 |
| _datetime (builtin) | GetSlot | - | 1.14x slower |
| _decimal | GetToken | - | 1.12x slower |
| _decimal | GetSlot | - | 1.16x slower |
- Win PGO
| module | func | inline | result |
|---|---|---|---|
| _datetime (builtin) | GetToken | yes | 1.00 |
| _datetime (builtin) | GetSlot | yes | 1.02x slower |
| _decimal | GetToken | - | 1.19x slower |
| _decimal | GetSlot | - | 1.22x slower |
The tp_bases can be used in a fallback implementation. I checked the overhead, adding a repeat function temporarily (100972db793a2cb545600289aa91fb856d862fb8):
from timeit import timeit
setup = f"""if 1:
import _testcapi
A = _testcapi.create_type_with_token("_testcapi.A", 0)
tokenA = _testcapi.get_tp_token(A)
class B(A): pass
class C(B): pass
class D(C): pass
class E(D): pass
getbase = _testcapi.repeat_getbasebytoken
"""
c_repeat = 10 # py_repeat = timeit default (1000000)
mro = timeit(s1 := f'getbase(C, tokenA, {c_repeat}, True)', setup)
bases = timeit(s2 := f'getbase(C, tokenA, {c_repeat}, False)', setup)
print(s1, mro)
print(s2, bases, bases / mro)
Win non-debug: (the higher, the slower)
| find A from |
starts with |
run once in C |
repeat 10 in C |
|---|---|---|---|
| mro | class A | 1.00 | 1.00 |
| bases | class A | 1.00x | 1.00x |
| mro | class B | 1.00x | 1.05x |
| bases | class B | 1.01x | 1.14x |
| mro | class C | 1.01x | 1.10x |
| bases | class C | 1.04x | 1.40x |
| mro | class E | 1.02x | 1.16x |
| bases | class E | 1.08x | 1.50x |
The commit 2e59344a90a65c65ec6bba43a79ed5138d0da427 added repeat functions to compare the performance with PyType_GetModuleByDef().
Find superclass A:
from timeit import timeit
setup = f"""if 1:
import _testcapi
A = _testcapi.create_type_with_token("_testcapi.A", 0)
tokenA = _testcapi.get_tp_token(A)
class B(A): pass
class C(B): pass
getbase = _testcapi.repeat_getbasebytoken
getslot = _testcapi.repeat_getslot
getmodonce = _testcapi.repeat_getmodonce
getmodeach = _testcapi.repeat_getmodeach
"""
c_repeat = 10
r0 = timeit(s0 := f'getbase (A, tokenA, {c_repeat})', setup)
r1 = timeit(s1 := f'getslot (A, tokenA, {c_repeat})', setup)
r2 = timeit(s2 := f'getmodonce(A, A, {c_repeat})', setup)
r3 = timeit(s3 := f'getmodeach(A, A, {c_repeat})', setup)
r4 = timeit(s4 := f'getbase (C, tokenA, {c_repeat})', setup)
r5 = timeit(s5 := f'getmodonce(C, A, {c_repeat})', setup)
r6 = timeit(s6 := f'getmodeach(C, A, {c_repeat})', setup)
print(s0, r0)
print(s1, r1, r1 / r0)
print(s2, r2, r2 / r0)
print(s3, r3, r3 / r0)
print(s4, r4, r4 / r0)
print(s5, r5, r5 / r0)
print(s6, r6, r6 / r0)
Win non-debug, non-PGO (the higher, the slower)
| subclass: A | GetModuleByDef() |
no repeat in C |
repeat 10 in C |
|---|---|---|---|
GetBaseByToken(subclass, tokenA) |
- | 1.00 | 1.00 |
GetSlot(subclass) == tokenA |
- | 1.00x | 1.00x |
subclass == mod_state->A |
once | 0.97x | 0.76x |
| for each | (ditto) | 1.26x | |
| subclass: C | |||
GetBaseByToken(subclass, tokenA) |
- | 1.01x (1.00 ) |
1.16x (1.00 ) |
IsSubtype(subclass, A) |
once | 1.03x (1.01x) |
1.04x (0.91x) |
| for each | (ditto) | 1.55x (1.35x) |
UPDATE: The results ware a bit unfair because the overhead of PyLong_AsVoidPtr() was not added to PyType_GetModuleByDef's tests.
I left some suggestions as a PR: https://github.com/neonene/cpython/pull/1
Tested the performance again, which now takes into account the overhead of PyLong_AsVoidPtr(): 7735ef1fe2fa5f94215dbb6a56ccd1a4be5b0f9d[^1].
script (expand)
from timeit import timeit
setup = """if 1:
import _testcapi
A = _testcapi.create_type_with_token("_testcapi.A", 0)
tokenA = _testcapi.get_tp_token(A)
class B(A): pass
class C(B): pass
getbase = _testcapi.repeat_getbasebytoken # NULL
getslot = _testcapi.repeat_getslot
getmodonce = _testcapi.repeat_getmodonce
getmodeach = _testcapi.repeat_getmodeach
issubtype = _testcapi.repeat_issubtype
"""
c_repeat = 10
r0 = timeit(s0 := f'getbase (A, tokenA, A, {c_repeat})', setup)
r1 = timeit(s1 := f'getslot (A, tokenA, A, {c_repeat})', setup)
r2 = timeit(s2 := f'getmodonce(A, tokenA, A, {c_repeat})', setup)
r3 = timeit(s3 := f'getmodeach(A, tokenA, A, {c_repeat})', setup)
r4 = timeit(s4 := f'issubtype (A, tokenA, A, {c_repeat})', setup)
r5 = timeit(s5 := f'getbase (C, tokenA, A, {c_repeat})', setup)
r6 = timeit(s6 := f'getmodonce(C, tokenA, A, {c_repeat})', setup)
r7 = timeit(s7 := f'getmodeach(C, tokenA, A, {c_repeat})', setup)
r8 = timeit(s8 := f'issubtype (C, tokenA, A, {c_repeat})', setup)
print(s0, r0 / r0)
print(s1, r1 / r0)
print(s2, r2 / r0)
print(s3, r3 / r0)
print(s4, r4 / r0)
print(s5, r5 / r0)
print(s6, r6 / r0)
print(s7, r7 / r0)
print(s8, r8 / r0)
Repeat n-times in a C function (Windows PGO, the higher, the slower):
| find super-class A | n = 1 | n = 10 | n = 1 | n = 10 |
|---|---|---|---|---|
| subclass A | ||||
GetBaseByToken * n |
(base) | (base) | ||
GetSlot * n |
1.00x | 1.01x | ||
GetModuleByDef + CheckExact * n |
1.01x | 0.92x | ||
(GetModuleByDef + CheckExact) * n |
1.00x | 1.06x | ||
IsSubtype * n |
0.99x | 0.97x | ||
| subclass C | ||||
GetBaseByToken * n |
1.00x | 1.13x | (base) | (base) |
GetModuleByDef + IsSubtype * n |
1.02x | 1.08x | 1.04x | 0.97x |
(GetModuleByDef + IsSubtype) * n |
1.02x | 1.34x | 1.04x | 1.21x |
IsSubtype * n |
1.01x | 1.03x | 0.99x | 0.95x |
[^1]: UPDATE: || may be better in "if (!token && !super)", but both should be OK unless token or super are optimized away before reaching the branch.