cpython icon indicating copy to clipboard operation
cpython copied to clipboard

Introduce PyType_GetBaseByToken function

Open neonene opened this issue 1 year ago • 2 comments

Reference implementation of the following C-API functinons:

  • PyType_GetBaseByToken()
  • ~PyType_GetToken()~

Discussion: https://discuss.python.org/t/55598


📚 Documentation preview 📚: https://cpython-previews--121079.org.readthedocs.build/

neonene avatar Jun 27 '24 08:06 neonene

Regarding PyType_GetToken() vs PyType_GetSlot(), the following show performance differences on the experiment commit at 4fb9cf80a9e71539e0c01b579c4218c9149f78f1 (now removed), which were noisy:

from timeit import timeit
setup = (
    'import _testcapi, _datetime, _decimal;'
    'A = _testcapi.create_type_with_token("_testcapi.A", 0);'
    'm = (_datetime, _decimal)[0]'
)
depth = 10  # test with ncall chain rather than nloop in C level

t1 = timeit(stmt1 := f'm.test_perf_gettoken(A, {depth})', setup)
t2 = timeit(stmt2 := f'm.test_perf_getslot (A, {depth})', setup)
print(stmt1, t1)
print(stmt2, t2,  t2 / t1)
  • Win non-debug
module func inline result
_datetime (builtin) GetToken yes 1.00
_datetime (builtin) GetSlot - 1.14x slower
_decimal GetToken - 1.12x slower
_decimal GetSlot - 1.16x slower
  • Win PGO
module func inline result
_datetime (builtin) GetToken yes 1.00
_datetime (builtin) GetSlot yes 1.02x slower
_decimal GetToken - 1.19x slower
_decimal GetSlot - 1.22x slower

neonene avatar Jun 29 '24 00:06 neonene

The tp_bases can be used in a fallback implementation. I checked the overhead, adding a repeat function temporarily (100972db793a2cb545600289aa91fb856d862fb8):

from timeit import timeit
setup = f"""if 1:
    import _testcapi
    A = _testcapi.create_type_with_token("_testcapi.A", 0)
    tokenA = _testcapi.get_tp_token(A)
    class B(A): pass
    class C(B): pass
    class D(C): pass
    class E(D): pass
    getbase = _testcapi.repeat_getbasebytoken
"""
c_repeat = 10  # py_repeat = timeit default (1000000)
mro   = timeit(s1 := f'getbase(C, tokenA, {c_repeat}, True)', setup)
bases = timeit(s2 := f'getbase(C, tokenA, {c_repeat}, False)', setup)
print(s1, mro)
print(s2, bases,  bases / mro)

Win non-debug: (the higher, the slower)

find A
from
starts
with
run once
in C
repeat 10
in C
mro class A 1.00 1.00
bases class A 1.00x 1.00x
mro class B 1.00x 1.05x
bases class B 1.01x 1.14x
mro class C 1.01x 1.10x
bases class C 1.04x 1.40x
mro class E 1.02x 1.16x
bases class E 1.08x 1.50x

neonene avatar Jul 03 '24 12:07 neonene

The commit 2e59344a90a65c65ec6bba43a79ed5138d0da427 added repeat functions to compare the performance with PyType_GetModuleByDef().

Find superclass A:

from timeit import timeit
setup = f"""if 1:
    import _testcapi
    A = _testcapi.create_type_with_token("_testcapi.A", 0)
    tokenA = _testcapi.get_tp_token(A)
    class B(A): pass
    class C(B): pass
    getbase    = _testcapi.repeat_getbasebytoken
    getslot    = _testcapi.repeat_getslot
    getmodonce = _testcapi.repeat_getmodonce
    getmodeach = _testcapi.repeat_getmodeach
"""
c_repeat = 10
r0 = timeit(s0 := f'getbase   (A, tokenA, {c_repeat})', setup)
r1 = timeit(s1 := f'getslot   (A, tokenA, {c_repeat})', setup)
r2 = timeit(s2 := f'getmodonce(A,      A, {c_repeat})', setup)
r3 = timeit(s3 := f'getmodeach(A,      A, {c_repeat})', setup)
r4 = timeit(s4 := f'getbase   (C, tokenA, {c_repeat})', setup)
r5 = timeit(s5 := f'getmodonce(C,      A, {c_repeat})', setup)
r6 = timeit(s6 := f'getmodeach(C,      A, {c_repeat})', setup)
print(s0, r0)
print(s1, r1,  r1 / r0)
print(s2, r2,  r2 / r0)
print(s3, r3,  r3 / r0)
print(s4, r4,  r4 / r0)
print(s5, r5,  r5 / r0)
print(s6, r6,  r6 / r0)

Win non-debug, non-PGO (the higher, the slower)

subclass: A GetModule
ByDef()
no repeat
in C
repeat 10
in C
GetBaseByToken(subclass, tokenA) - 1.00 1.00
GetSlot(subclass) == tokenA - 1.00x 1.00x
subclass == mod_state->A once 0.97x 0.76x
for each (ditto) 1.26x
subclass: C
GetBaseByToken(subclass, tokenA) - 1.01x
(1.00 )
1.16x
(1.00 )
IsSubtype(subclass, A) once 1.03x
(1.01x)
1.04x
(0.91x)
for each (ditto) 1.55x
(1.35x)

UPDATE: The results ware a bit unfair because the overhead of PyLong_AsVoidPtr() was not added to PyType_GetModuleByDef's tests.

neonene avatar Jul 05 '24 09:07 neonene

I left some suggestions as a PR: https://github.com/neonene/cpython/pull/1

encukou avatar Jul 22 '24 12:07 encukou

Tested the performance again, which now takes into account the overhead of PyLong_AsVoidPtr(): 7735ef1fe2fa5f94215dbb6a56ccd1a4be5b0f9d[^1].

script (expand)
from timeit import timeit
setup = """if 1:
    import _testcapi
    A = _testcapi.create_type_with_token("_testcapi.A", 0)
    tokenA = _testcapi.get_tp_token(A)
    class B(A): pass
    class C(B): pass
    getbase    = _testcapi.repeat_getbasebytoken  # NULL
    getslot    = _testcapi.repeat_getslot
    getmodonce = _testcapi.repeat_getmodonce
    getmodeach = _testcapi.repeat_getmodeach
    issubtype  = _testcapi.repeat_issubtype
"""
c_repeat = 10
r0 = timeit(s0 := f'getbase   (A, tokenA, A, {c_repeat})', setup)
r1 = timeit(s1 := f'getslot   (A, tokenA, A, {c_repeat})', setup)
r2 = timeit(s2 := f'getmodonce(A, tokenA, A, {c_repeat})', setup)
r3 = timeit(s3 := f'getmodeach(A, tokenA, A, {c_repeat})', setup)
r4 = timeit(s4 := f'issubtype (A, tokenA, A, {c_repeat})', setup)
r5 = timeit(s5 := f'getbase   (C, tokenA, A, {c_repeat})', setup)
r6 = timeit(s6 := f'getmodonce(C, tokenA, A, {c_repeat})', setup)
r7 = timeit(s7 := f'getmodeach(C, tokenA, A, {c_repeat})', setup)
r8 = timeit(s8 := f'issubtype (C, tokenA, A, {c_repeat})', setup)
print(s0, r0 / r0)
print(s1, r1 / r0)
print(s2, r2 / r0)
print(s3, r3 / r0)
print(s4, r4 / r0)
print(s5, r5 / r0)
print(s6, r6 / r0)
print(s7, r7 / r0)
print(s8, r8 / r0)

Repeat n-times in a C function (Windows PGO, the higher, the slower):

find super-class A n = 1 n = 10 n = 1 n = 10
subclass A
GetBaseByToken * n (base) (base)
GetSlot * n 1.00x 1.01x
GetModuleByDef + CheckExact * n 1.01x 0.92x
(GetModuleByDef + CheckExact) * n 1.00x 1.06x
IsSubtype * n 0.99x 0.97x
subclass C
GetBaseByToken * n 1.00x 1.13x (base) (base)
GetModuleByDef + IsSubtype * n 1.02x 1.08x 1.04x 0.97x
(GetModuleByDef + IsSubtype) * n 1.02x 1.34x 1.04x 1.21x
IsSubtype * n 1.01x 1.03x 0.99x 0.95x

[^1]: UPDATE: || may be better in "if (!token && !super)", but both should be OK unless token or super are optimized away before reaching the branch.

neonene avatar Aug 27 '24 08:08 neonene