[BUG] performance regressions in 3.1
Describe the bug
I ran the pyperformance benchmark bm_async_generators in Cython 3.0.x and master.
https://github.com/python/pyperformance/blob/56d12a8fd7cc1432835965d374929bfa7f6f7a07/pyperformance/data-files/benchmarks/bm_async_generators/run_benchmark.py
Cython 3.1 / master: $ python3.12 -m timeit -s 'import asyncio, run_benchmark as rb' 'asyncio.run(rb.bench_async_generators())' 1 loop, best of 5: 385 msec per loop
Cython 3.0.x: $ python3.12 -m timeit -s 'import asyncio, run_benchmark as rb' 'asyncio.run(rb.bench_async_generators())' 1 loop, best of 5: 247 msec per loop
This is a very serious regression. perf shows me this:
3.0.x:
7,76% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_asend_send
5,96% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_gb_13run_benchmark_4Tree_4generator
7,76% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_asend_send
5,96% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_gb_13run_benchmark_4Tree_4generator
4,95% python python3.12 [.] _PyEval_EvalFrameDefault
4,60% python python3.12 [.] PyTuple_New
4,55% python python3.12 [.] PyObject_Free
3,46% python python3.12 [.] PyType_GenericAlloc
3,35% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_Coroutine_Send
3,19% python python3.12 [.] PyObject_Malloc
2,52% python python3.12 [.] 0x0000000000154710
2,46% python python3.12 [.] PyObject_GC_UnTrack
2,10% python python3.12 [.] PyObject_GC_Track
2,06% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_PyGen__FetchStopIterationValue.constpro
1,95% python python3.12 [.] _Py_NewReference
1,78% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_Coroutine_SendEx
1,29% python libc.so.6 [.] __memset_evex_unaligned_erms
1,25% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx__PyAsyncGenValueWrapperNew
1,01% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_pw_13run_benchmark_1tree
0,97% python python3.12 [.] PyObject_GC_Del
0,97% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_init_hooks
0,93% python python3.12 [.] _PyTrash_end
0,89% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_asend_dealloc
0,88% python python3.12 [.] _PyErr_SetObject
0,85% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_wrapped_val_dealloc
0,80% python python3.12 [.] _PyThreadState_UncheckedGet
0,79% python python3.12 [.] _PyTrash_begin
0,78% python python3.12 [.] _PyObject_GenericGetAttrWithDict
0,78% python python3.12 [.] _Py_CheckFunctionResult
0,72% python python3.12 [.] _PyType_Lookup
0,65% python python3.12 [.] _Py_Dealloc
0,56% python python3.12 [.] 0x0000000000178f00
0,54% python python3.12 [.] PyObject_GenericSetAttr
0,53% python python3.12 [.] PyObject_RichCompareBool
0,50% python python3.12 [.] 0x000000000015477e
0,42% python python3.12 [.] _PyObject_MakeTpCall
0,39% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_Coroutine_clear.isra.0
0,39% python python3.12 [.] _PySlice_GetLongIndices
0,37% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] _Py_Dealloc@plt
0,36% python python3.12 [.] PySlice_New
0,32% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] PyObject_GC_Track@plt
0,32% python python3.12 [.] 0x000000000021c4ff
0,32% python python3.12 [.] 0x00000000001914d1
0,30% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] _Py_NewReference@plt
0,30% python python3.12 [.] PyObject_IsTrue
0,29% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_pw_13run_benchmark_4Tree_3__aiter__
master:
6,16% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_asend_send
5,40% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_gb_13run_benchmark_4Tree_4generator
5,19% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx__ReturnWithStopIteration
6,16% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_asend_send
5,40% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_gb_13run_benchmark_4Tree_4generator
5,19% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx__ReturnWithStopIteration
4,96% python python3.12 [.] _PyEval_EvalFrameDefault
3,62% python python3.12 [.] PyType_GenericAlloc
3,42% python python3.12 [.] PyObject_Free
3,13% python python3.12 [.] PyObject_Malloc
2,97% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_PyGen__FetchStopIterationValue
2,62% python python3.12 [.] _PyObject_MakeTpCall
2,50% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_Coroutine_SendEx
2,47% python python3.12 [.] PyObject_GC_Track
2,41% python python3.12 [.] PyObject_GC_UnTrack
2,10% python python3.12 [.] 0x0000000000154710
1,78% python python3.12 [.] _Py_NewReference
1,31% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_asend_dealloc
1,19% python python3.12 [.] PyObject_VectorcallDict
1,02% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx__PyAsyncGenValueWrapperNew
0,91% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_wrapped_val_dealloc
0,90% python python3.12 [.] PyObject_GC_Del
0,89% python python3.12 [.] _PyErr_SetObject
0,84% python libc.so.6 [.] __memset_evex_unaligned_erms
0,84% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_pw_13run_benchmark_1tree
0,75% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __Pyx_async_gen_init_hooks
0,73% python python3.12 [.] _PyObject_GenericGetAttrWithDict
0,70% python python3.12 [.] _Py_Dealloc
0,69% python python3.12 [.] _PyTrash_end
0,64% python python3.12 [.] _PyType_Lookup
0,63% python python3.12 [.] _PyTrash_begin
0,59% python python3.12 [.] _Py_CheckFunctionResult
0,59% python python3.12 [.] _PyThreadState_UncheckedGet
0,53% python python3.12 [.] 0x000000000015477e
0,50% python python3.12 [.] PyObject_GenericSetAttr
0,48% python python3.12 [.] 0x0000000000178f00
0,47% python python3.12 [.] PyObject_RichCompareBool
0,44% python python3.12 [.] _PySlice_GetLongIndices
0,41% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] _Py_Dealloc@plt
0,40% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_gb_13run_benchmark_4generator1
0,37% python python3.12 [.] 0x0000000000199b25
0,35% python python3.12 [.] PyNumber_Add
0,35% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] _PyThreadState_UncheckedGet@plt
0,33% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_pw_13run_benchmark_4Tree_3__aiter__
0,33% python python3.12 [.] 0x00000000001909e2
0,32% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] PyObject_GC_Track@plt
0,30% python python3.12 [.] 0x00000000001914d1
0,28% python python3.12 [.] 0x0000000000190ac1
0,28% python python3.12 [.] PyObject_IsTrue
0,28% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] PyObject_GC_UnTrack@plt
0,28% python python3.12 [.] _PyTrash_cond
0,27% python run_benchmark.cpython-312-x86_64-linux-gnu.so [.] __pyx_pw_13run_benchmark_4Tree_1__init__
So, it would seem that we lost the ability to return from generators without setting a StopIteration, which now eats lots of runtime.
Code to reproduce the behaviour:
No response
Expected behaviour
No response
OS
No response
Python version
3.12.3
Cython version
master
Additional context
No response
After trying a couple of more benchmarks, it seems that there are some more regressions. Most benchmarks I tried are slightly to visibly slower with the master branch than with 3.0.x. Most of the pyperformance benchmarks are difficult to run with Cython, but I tried "nbody", "raytrace", "comprehensions", "async_generators", all of them single file benchmarks that allow easy importing as extension modules.
This suggests that we lost performance in several areas.
I think I've fixed the degradation in raytrace in https://github.com/cython/cython/pull/6730. I'd assume this fix makes a mild difference to most code (but haven't tested in on the others).
I added a benchmark runner to the CI jobs. Here's the latest run:
2025-03-16 18:01:08,682 ### Benchmark 'bm_async_generators' (min/median/max):
2025-03-16 18:01:08,682 origin/3.0.x = 2.286 msec, 2.301 msec, 2.607 msec
2025-03-16 18:01:08,682 origin/master = 2.402 msec, 2.430 msec, 2.463 msec
2025-03-16 18:01:08,683 Python = 3.128 msec, 3.150 msec, 3.176 msec
2025-03-16 18:01:08,683 ### Benchmark 'bm_chaos' (min/median/max):
2025-03-16 18:01:08,683 origin/3.0.x = 27.370 msec, 27.527 msec, 28.077 msec
2025-03-16 18:01:08,683 origin/master = 29.354 msec, 29.498 msec, 30.740 msec
2025-03-16 18:01:08,683 Python = 73.219 msec, 73.573 msec, 74.483 msec
2025-03-16 18:01:08,683 ### Benchmark 'bm_coroutines' (min/median/max):
2025-03-16 18:01:08,683 origin/3.0.x = 171.114 msec, 171.783 msec, 173.794 msec
2025-03-16 18:01:08,683 origin/master = 93.204 msec, 94.040 msec, 98.909 msec
2025-03-16 18:01:08,683 Python = 113.180 msec, 113.359 msec, 114.323 msec
2025-03-16 18:01:08,683 ### Benchmark 'bm_fib' (min/median/max):
2025-03-16 18:01:08,683 origin/3.0.x = 22.046 msec, 22.076 msec, 22.120 msec
2025-03-16 18:01:08,683 origin/master = 21.862 msec, 21.920 msec, 22.044 msec
2025-03-16 18:01:08,683 Python = 1.215 sec, 1.218 sec, 1.224 sec
2025-03-16 18:01:08,683 ### Benchmark 'bm_generators' (min/median/max):
2025-03-16 18:01:08,683 origin/3.0.x = 2.142 msec, 2.200 msec, 2.627 msec
2025-03-16 18:01:08,683 origin/master = 2.265 msec, 2.280 msec, 2.428 msec
2025-03-16 18:01:08,683 Python = 6.757 msec, 6.789 msec, 7.314 msec
2025-03-16 18:01:08,683 ### Benchmark 'bm_meteor_contest' (min/median/max):
2025-03-16 18:01:08,683 origin/3.0.x = 63.879 msec, 64.197 msec, 65.194 msec
2025-03-16 18:01:08,683 origin/master = 62.812 msec, 63.578 msec, 63.834 msec
2025-03-16 18:01:08,683 Python = 79.620 msec, 79.710 msec, 80.021 msec
2025-03-16 18:01:08,683 ### Benchmark 'bm_nbody' (min/median/max):
2025-03-16 18:01:08,683 origin/3.0.x = 100.562 msec, 100.662 msec, 100.831 msec
2025-03-16 18:01:08,683 origin/master = 66.390 msec, 66.773 msec, 67.106 msec
2025-03-16 18:01:08,683 Python = 115.132 msec, 116.452 msec, 117.587 msec
2025-03-16 18:01:08,683 ### Benchmark 'bm_nqueens' (min/median/max):
2025-03-16 18:01:08,684 origin/3.0.x = 32.170 msec, 32.293 msec, 32.388 msec
2025-03-16 18:01:08,684 origin/master = 32.699 msec, 32.763 msec, 33.076 msec
2025-03-16 18:01:08,684 Python = 66.299 msec, 66.556 msec, 66.804 msec
2025-03-16 18:01:08,684 ### Benchmark 'bm_raytrace' (min/median/max):
2025-03-16 18:01:08,684 origin/3.0.x = 1.520 sec, 1.521 sec, 1.532 sec
2025-03-16 18:01:08,684 origin/master = 1.592 sec, 1.600 sec, 1.609 sec
2025-03-16 18:01:08,684 Python = 966.737 msec, 970.437 msec, 986.784 msec
2025-03-16 18:01:08,684 ### Benchmark 'bm_unpack_sequence' (min/median/max):
2025-03-16 18:01:08,684 origin/3.0.x = 2.484 msec, 2.499 msec, 2.582 msec
2025-03-16 18:01:08,684 origin/master = 2.326 msec, 2.327 msec, 2.339 msec
2025-03-16 18:01:08,684 Python = 9.379 msec, 9.407 msec, 9.466 msec
So, for most of these (simple) benchmarks, it doesn't look that bad. The raytrace benchmark seems worth looking into, although that's not a regression, at least. Might also be a pessimisation due to static typing. But there are also a few others that got slower since 3.0.x. We should make sure that we didn't do something stupid here. Users shouldn't have a reason to stick with 3.0.x.
The raytrace benchmark is dominated by attribute lookups:
7,93% python3 python3.12 [.] _PyObject_GenericGetAttrWithDict
4,26% python3 python3.12 [.] _PyType_Lookup
I'm running these with CPython 3.12/3.13 which received a lot of internal optimisations for all sorts of standard object operations, clearly including attribute lookups. We should see if we can catch up somehow.
According to callgrind, 10% of the runtime of the bm_async_generators benchmark are spent in 10% of the calls to __Pyx_async_gen_anext() that set up the asyncio hooks with __Pyx_async_gen_init_hooks() (which executes a simple Python function). That seems worth optimising in asyncio, not in Cython. :-/
It also appears that StopIteration tends to have a traceback, which is costly and useless in almost all relevant cases. Worth investigating if that's us adding one or CPython. (EDIT: According to gdb, __Pyx_AddTraceback() is never called.)
Furthermore, __Pyx__ReturnWithStopIteration() tries to look up a vectorcall function for the StopIteration instantiation and fails, falling back to PyObject_VectorcallDict() in __Pyx_PyObject_FastCallDict(). That seems costly. Maybe we can speed up the exception creation here? At least, trying to do a vectorcall that never succeeds is unnecessarily inefficient. It might even be faster to call PyErr_SetObject(exc_type, return_value) – ~although CPython seems to do the same thing there (trying vectorcall first)~. Edit: Ah, CPython 3.14 implements vectorcall for exceptions. 3.12 doesn't yet. That makes a difference, because we always try vectorcall first, and 3.12/13 end up losing performance.
__Pyx__ReturnWithStopIteration()tries to look up a vectorcall function for theStopIterationinstantiation and fails
https://github.com/cython/cython/pull/6738
It's interesting to see that our fstrings implementation apparently got slower between Python 3.10 and 3.12:
2025-03-18 16:10:46 ### Benchmark 'bm_fstrings' (min/median/max):
2025-03-18 16:10:46 Cython 'origin/3.0.x' = 1.121 msec, 1.124 msec, 1.135 msec
2025-03-18 16:10:46 Cython 'origin/master' = 1.144 msec, 1.147 msec, 1.287 msec
2025-03-18 16:10:46 Python 3.10.16 = 1.420 msec, 1.431 msec, 1.439 msec
2025-03-18 16:14:51 ### Benchmark 'bm_fstrings' (min/median/max):
2025-03-18 16:14:51 Cython 'origin/3.0.x' = 1.803 msec, 1.866 msec, 2.408 msec
2025-03-18 16:14:51 Cython 'origin/master' = 1.742 msec, 1.753 msec, 1.778 msec
2025-03-18 16:14:51 Python 3.12.9 = 1.614 msec, 1.631 msec, 1.705 msec
2025-03-18 16:22:46 ### Benchmark 'bm_fstrings' (min/median/max):
2025-03-18 16:22:46 Cython 'origin/3.0.x' = 1.804 msec, 1.815 msec, 1.834 msec
2025-03-18 16:22:46 Cython 'origin/master' = 1.865 msec, 1.884 msec, 1.913 msec
2025-03-18 16:22:46 Python 3.14.0 = 1.662 msec, 1.681 msec, 1.728 msec
It looks different if we switch to formatting C numbers, that's much faster than letting Python do it. The above is mostly formatting Python object values. (Timings are from https://github.com/cython/cython/commit/826d852716af77ed407a7e1cfda6d516a13e784b)
I included the Limited C-API in the benchmark runner to compare against normal Cython. It's currently only enabled for Py3.14.
Turns out that we disabled vectorcall for extension types and builtins back when they didn't support it, and didn't enable it when they started to support it. I found this when looking at the bm_chaos benchmark that uses a couple of extension types. https://github.com/cython/cython/pull/6744
is dis fixed
is [th]is fixed [?]
Not all of it. But you should do your own benchmarks to see how Cython 3.1 performs for your specific code.