cpython Introduce PyType_GetBaseByToken function

Reference implementation of the following C-API functinons:

PyType_GetBaseByToken()
~PyType_GetToken()~

Discussion: https://discuss.python.org/t/55598

📚 Documentation preview 📚: https://cpython-previews--121079.org.readthedocs.build/

Jun 27 '24 08:06 neonene

Regarding PyType_GetToken() vs PyType_GetSlot(), the following show performance differences on the experiment commit at 4fb9cf80a9e71539e0c01b579c4218c9149f78f1 (now removed), which were noisy:

from timeit import timeit
setup = (
    'import _testcapi, _datetime, _decimal;'
    'A = _testcapi.create_type_with_token("_testcapi.A", 0);'
    'm = (_datetime, _decimal)[0]'
)
depth = 10  # test with ncall chain rather than nloop in C level

t1 = timeit(stmt1 := f'm.test_perf_gettoken(A, {depth})', setup)
t2 = timeit(stmt2 := f'm.test_perf_getslot (A, {depth})', setup)
print(stmt1, t1)
print(stmt2, t2,  t2 / t1)

Win non-debug

module	func	inline	result
_datetime (builtin)	GetToken	yes	1.00
_datetime (builtin)	GetSlot	-	1.14x slower
_decimal	GetToken	-	1.12x slower
_decimal	GetSlot	-	1.16x slower

Win PGO

module	func	inline	result
_datetime (builtin)	GetToken	yes	1.00
_datetime (builtin)	GetSlot	yes	1.02x slower
_decimal	GetToken	-	1.19x slower
_decimal	GetSlot	-	1.22x slower

Jun 29 '24 00:06 neonene

The tp_bases can be used in a fallback implementation. I checked the overhead, adding a repeat function temporarily (100972db793a2cb545600289aa91fb856d862fb8):

from timeit import timeit
setup = f"""if 1:
    import _testcapi
    A = _testcapi.create_type_with_token("_testcapi.A", 0)
    tokenA = _testcapi.get_tp_token(A)
    class B(A): pass
    class C(B): pass
    class D(C): pass
    class E(D): pass
    getbase = _testcapi.repeat_getbasebytoken
"""
c_repeat = 10  # py_repeat = timeit default (1000000)
mro   = timeit(s1 := f'getbase(C, tokenA, {c_repeat}, True)', setup)
bases = timeit(s2 := f'getbase(C, tokenA, {c_repeat}, False)', setup)
print(s1, mro)
print(s2, bases,  bases / mro)

Win non-debug: (the higher, the slower)

find A from	starts with	run once in C	repeat 10 in C
mro	class A	1.00	1.00
bases	class A	1.00x	1.00x
mro	class B	1.00x	1.05x
bases	class B	1.01x	1.14x
mro	class C	1.01x	1.10x
bases	class C	1.04x	1.40x
mro	class E	1.02x	1.16x
bases	class E	1.08x	1.50x

Jul 03 '24 12:07 neonene

The commit 2e59344a90a65c65ec6bba43a79ed5138d0da427 added repeat functions to compare the performance with PyType_GetModuleByDef().

Find superclass A:

from timeit import timeit
setup = f"""if 1:
    import _testcapi
    A = _testcapi.create_type_with_token("_testcapi.A", 0)
    tokenA = _testcapi.get_tp_token(A)
    class B(A): pass
    class C(B): pass
    getbase    = _testcapi.repeat_getbasebytoken
    getslot    = _testcapi.repeat_getslot
    getmodonce = _testcapi.repeat_getmodonce
    getmodeach = _testcapi.repeat_getmodeach
"""
c_repeat = 10
r0 = timeit(s0 := f'getbase   (A, tokenA, {c_repeat})', setup)
r1 = timeit(s1 := f'getslot   (A, tokenA, {c_repeat})', setup)
r2 = timeit(s2 := f'getmodonce(A,      A, {c_repeat})', setup)
r3 = timeit(s3 := f'getmodeach(A,      A, {c_repeat})', setup)
r4 = timeit(s4 := f'getbase   (C, tokenA, {c_repeat})', setup)
r5 = timeit(s5 := f'getmodonce(C,      A, {c_repeat})', setup)
r6 = timeit(s6 := f'getmodeach(C,      A, {c_repeat})', setup)
print(s0, r0)
print(s1, r1,  r1 / r0)
print(s2, r2,  r2 / r0)
print(s3, r3,  r3 / r0)
print(s4, r4,  r4 / r0)
print(s5, r5,  r5 / r0)
print(s6, r6,  r6 / r0)

Win non-debug, non-PGO (the higher, the slower)

subclass: A	`GetModule` `ByDef()`	no repeat in C	repeat 10 in C
`GetBaseByToken(subclass, tokenA)`	-	1.00	1.00
`GetSlot(subclass) == tokenA`	-	1.00x	1.00x
`subclass == mod_state->A`	once	0.97x	0.76x
	for each	(ditto)	1.26x
subclass: C
`GetBaseByToken(subclass, tokenA)`	-	1.01x (1.00 )	1.16x (1.00 )
`IsSubtype(subclass, A)`	once	1.03x (1.01x)	1.04x (0.91x)
	for each	(ditto)	1.55x (1.35x)

UPDATE: The results ware a bit unfair because the overhead of PyLong_AsVoidPtr() was not added to PyType_GetModuleByDef's tests.

Jul 05 '24 09:07 neonene

I left some suggestions as a PR: https://github.com/neonene/cpython/pull/1

Jul 22 '24 12:07 encukou

Tested the performance again, which now takes into account the overhead of PyLong_AsVoidPtr(): 7735ef1fe2fa5f94215dbb6a56ccd1a4be5b0f9d[^1].

script (expand)

from timeit import timeit
setup = """if 1:
    import _testcapi
    A = _testcapi.create_type_with_token("_testcapi.A", 0)
    tokenA = _testcapi.get_tp_token(A)
    class B(A): pass
    class C(B): pass
    getbase    = _testcapi.repeat_getbasebytoken  # NULL
    getslot    = _testcapi.repeat_getslot
    getmodonce = _testcapi.repeat_getmodonce
    getmodeach = _testcapi.repeat_getmodeach
    issubtype  = _testcapi.repeat_issubtype
"""
c_repeat = 10
r0 = timeit(s0 := f'getbase   (A, tokenA, A, {c_repeat})', setup)
r1 = timeit(s1 := f'getslot   (A, tokenA, A, {c_repeat})', setup)
r2 = timeit(s2 := f'getmodonce(A, tokenA, A, {c_repeat})', setup)
r3 = timeit(s3 := f'getmodeach(A, tokenA, A, {c_repeat})', setup)
r4 = timeit(s4 := f'issubtype (A, tokenA, A, {c_repeat})', setup)
r5 = timeit(s5 := f'getbase   (C, tokenA, A, {c_repeat})', setup)
r6 = timeit(s6 := f'getmodonce(C, tokenA, A, {c_repeat})', setup)
r7 = timeit(s7 := f'getmodeach(C, tokenA, A, {c_repeat})', setup)
r8 = timeit(s8 := f'issubtype (C, tokenA, A, {c_repeat})', setup)
print(s0, r0 / r0)
print(s1, r1 / r0)
print(s2, r2 / r0)
print(s3, r3 / r0)
print(s4, r4 / r0)
print(s5, r5 / r0)
print(s6, r6 / r0)
print(s7, r7 / r0)
print(s8, r8 / r0)

Repeat n-times in a C function (Windows PGO, the higher, the slower):

find super-class A	n = 1	n = 10	n = 1	n = 10
subclass A
`GetBaseByToken` * n	(base)	(base)
`GetSlot` * n	1.00x	1.01x
`GetModuleByDef` + `CheckExact` * n	1.01x	0.92x
(`GetModuleByDef` + `CheckExact`) * n	1.00x	1.06x
`IsSubtype` * n	0.99x	0.97x
subclass C
`GetBaseByToken` * n	1.00x	1.13x	(base)	(base)
`GetModuleByDef` + `IsSubtype` * n	1.02x	1.08x	1.04x	0.97x
(`GetModuleByDef` + `IsSubtype`) * n	1.02x	1.34x	1.04x	1.21x
`IsSubtype` * n	1.01x	1.03x	0.99x	0.95x

[^1]: UPDATE: || may be better in "if (!token && !super)", but both should be OK unless token or super are optimized away before reaching the branch.

Aug 27 '24 08:08 neonene