cython icon indicating copy to clipboard operation
cython copied to clipboard

[BUG] performance regressions in 3.1

Open scoder opened this issue 9 months ago • 11 comments

Describe the bug

I ran the pyperformance benchmark bm_async_generators in Cython 3.0.x and master.

https://github.com/python/pyperformance/blob/56d12a8fd7cc1432835965d374929bfa7f6f7a07/pyperformance/data-files/benchmarks/bm_async_generators/run_benchmark.py

Cython 3.1 / master: $ python3.12 -m timeit -s 'import asyncio, run_benchmark as rb' 'asyncio.run(rb.bench_async_generators())' 1 loop, best of 5: 385 msec per loop

Cython 3.0.x: $ python3.12 -m timeit -s 'import asyncio, run_benchmark as rb' 'asyncio.run(rb.bench_async_generators())' 1 loop, best of 5: 247 msec per loop

This is a very serious regression. perf shows me this: 3.0.x:

   7,76%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_asend_send
   5,96%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_gb_13run_benchmark_4Tree_4generator
   7,76%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_asend_send                   
   5,96%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_gb_13run_benchmark_4Tree_4generator    
   4,95%  python   python3.12                                     [.] _PyEval_EvalFrameDefault                     
   4,60%  python   python3.12                                     [.] PyTuple_New                                  
   4,55%  python   python3.12                                     [.] PyObject_Free                                
   3,46%  python   python3.12                                     [.] PyType_GenericAlloc                          
   3,35%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_Coroutine_Send                         
   3,19%  python   python3.12                                     [.] PyObject_Malloc                              
   2,52%  python   python3.12                                     [.] 0x0000000000154710                           
   2,46%  python   python3.12                                     [.] PyObject_GC_UnTrack                          
   2,10%  python   python3.12                                     [.] PyObject_GC_Track                            
   2,06%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_PyGen__FetchStopIterationValue.constpro
   1,95%  python   python3.12                                     [.] _Py_NewReference                             
   1,78%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_Coroutine_SendEx                       
   1,29%  python   libc.so.6                                      [.] __memset_evex_unaligned_erms                 
   1,25%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx__PyAsyncGenValueWrapperNew             
   1,01%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_pw_13run_benchmark_1tree               
   0,97%  python   python3.12                                     [.] PyObject_GC_Del                              
   0,97%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_init_hooks                   
   0,93%  python   python3.12                                     [.] _PyTrash_end                                 
   0,89%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_asend_dealloc                
   0,88%  python   python3.12                                     [.] _PyErr_SetObject                             
   0,85%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_wrapped_val_dealloc          
   0,80%  python   python3.12                                     [.] _PyThreadState_UncheckedGet                  
   0,79%  python   python3.12                                     [.] _PyTrash_begin                               
   0,78%  python   python3.12                                     [.] _PyObject_GenericGetAttrWithDict             
   0,78%  python   python3.12                                     [.] _Py_CheckFunctionResult                      
   0,72%  python   python3.12                                     [.] _PyType_Lookup                               
   0,65%  python   python3.12                                     [.] _Py_Dealloc                                  
   0,56%  python   python3.12                                     [.] 0x0000000000178f00                           
   0,54%  python   python3.12                                     [.] PyObject_GenericSetAttr                      
   0,53%  python   python3.12                                     [.] PyObject_RichCompareBool                     
   0,50%  python   python3.12                                     [.] 0x000000000015477e                           
   0,42%  python   python3.12                                     [.] _PyObject_MakeTpCall                         
   0,39%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_Coroutine_clear.isra.0                 
   0,39%  python   python3.12                                     [.] _PySlice_GetLongIndices                      
   0,37%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] _Py_Dealloc@plt                              
   0,36%  python   python3.12                                     [.] PySlice_New                                  
   0,32%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] PyObject_GC_Track@plt                        
   0,32%  python   python3.12                                     [.] 0x000000000021c4ff                           
   0,32%  python   python3.12                                     [.] 0x00000000001914d1                           
   0,30%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] _Py_NewReference@plt                         
   0,30%  python   python3.12                                     [.] PyObject_IsTrue                              
   0,29%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_pw_13run_benchmark_4Tree_3__aiter__    

master:

   6,16%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_asend_send
   5,40%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_gb_13run_benchmark_4Tree_4generator
   5,19%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx__ReturnWithStopIteration
   6,16%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_asend_send                   
   5,40%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_gb_13run_benchmark_4Tree_4generator    
   5,19%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx__ReturnWithStopIteration               
   4,96%  python   python3.12                                     [.] _PyEval_EvalFrameDefault                     
   3,62%  python   python3.12                                     [.] PyType_GenericAlloc                          
   3,42%  python   python3.12                                     [.] PyObject_Free                                
   3,13%  python   python3.12                                     [.] PyObject_Malloc                              
   2,97%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_PyGen__FetchStopIterationValue         
   2,62%  python   python3.12                                     [.] _PyObject_MakeTpCall                         
   2,50%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_Coroutine_SendEx                       
   2,47%  python   python3.12                                     [.] PyObject_GC_Track                            
   2,41%  python   python3.12                                     [.] PyObject_GC_UnTrack                          
   2,10%  python   python3.12                                     [.] 0x0000000000154710                           
   1,78%  python   python3.12                                     [.] _Py_NewReference                             
   1,31%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_asend_dealloc                
   1,19%  python   python3.12                                     [.] PyObject_VectorcallDict                      
   1,02%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx__PyAsyncGenValueWrapperNew             
   0,91%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_wrapped_val_dealloc          
   0,90%  python   python3.12                                     [.] PyObject_GC_Del                              
   0,89%  python   python3.12                                     [.] _PyErr_SetObject                             
   0,84%  python   libc.so.6                                      [.] __memset_evex_unaligned_erms                 
   0,84%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_pw_13run_benchmark_1tree               
   0,75%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __Pyx_async_gen_init_hooks                   
   0,73%  python   python3.12                                     [.] _PyObject_GenericGetAttrWithDict             
   0,70%  python   python3.12                                     [.] _Py_Dealloc                                  
   0,69%  python   python3.12                                     [.] _PyTrash_end                                 
   0,64%  python   python3.12                                     [.] _PyType_Lookup                               
   0,63%  python   python3.12                                     [.] _PyTrash_begin                               
   0,59%  python   python3.12                                     [.] _Py_CheckFunctionResult                      
   0,59%  python   python3.12                                     [.] _PyThreadState_UncheckedGet                  
   0,53%  python   python3.12                                     [.] 0x000000000015477e                           
   0,50%  python   python3.12                                     [.] PyObject_GenericSetAttr                      
   0,48%  python   python3.12                                     [.] 0x0000000000178f00                           
   0,47%  python   python3.12                                     [.] PyObject_RichCompareBool                     
   0,44%  python   python3.12                                     [.] _PySlice_GetLongIndices                      
   0,41%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] _Py_Dealloc@plt                              
   0,40%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_gb_13run_benchmark_4generator1         
   0,37%  python   python3.12                                     [.] 0x0000000000199b25                           
   0,35%  python   python3.12                                     [.] PyNumber_Add                                 
   0,35%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] _PyThreadState_UncheckedGet@plt              
   0,33%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_pw_13run_benchmark_4Tree_3__aiter__    
   0,33%  python   python3.12                                     [.] 0x00000000001909e2                           
   0,32%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] PyObject_GC_Track@plt                        
   0,30%  python   python3.12                                     [.] 0x00000000001914d1                           
   0,28%  python   python3.12                                     [.] 0x0000000000190ac1                           
   0,28%  python   python3.12                                     [.] PyObject_IsTrue                              
   0,28%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] PyObject_GC_UnTrack@plt                      
   0,28%  python   python3.12                                     [.] _PyTrash_cond                                
   0,27%  python   run_benchmark.cpython-312-x86_64-linux-gnu.so  [.] __pyx_pw_13run_benchmark_4Tree_1__init__     

So, it would seem that we lost the ability to return from generators without setting a StopIteration, which now eats lots of runtime.

Code to reproduce the behaviour:

No response

Expected behaviour

No response

OS

No response

Python version

3.12.3

Cython version

master

Additional context

No response

scoder avatar Mar 09 '25 18:03 scoder

After trying a couple of more benchmarks, it seems that there are some more regressions. Most benchmarks I tried are slightly to visibly slower with the master branch than with 3.0.x. Most of the pyperformance benchmarks are difficult to run with Cython, but I tried "nbody", "raytrace", "comprehensions", "async_generators", all of them single file benchmarks that allow easy importing as extension modules.

This suggests that we lost performance in several areas.

scoder avatar Mar 10 '25 08:03 scoder

I think I've fixed the degradation in raytrace in https://github.com/cython/cython/pull/6730. I'd assume this fix makes a mild difference to most code (but haven't tested in on the others).

da-woods avatar Mar 15 '25 11:03 da-woods

I added a benchmark runner to the CI jobs. Here's the latest run:

2025-03-16 18:01:08,682  ### Benchmark 'bm_async_generators' (min/median/max):
2025-03-16 18:01:08,682      origin/3.0.x         =   2.286 msec,   2.301 msec,   2.607 msec
2025-03-16 18:01:08,682      origin/master        =   2.402 msec,   2.430 msec,   2.463 msec
2025-03-16 18:01:08,683      Python               =   3.128 msec,   3.150 msec,   3.176 msec
2025-03-16 18:01:08,683  ### Benchmark 'bm_chaos' (min/median/max):
2025-03-16 18:01:08,683      origin/3.0.x         =  27.370 msec,  27.527 msec,  28.077 msec
2025-03-16 18:01:08,683      origin/master        =  29.354 msec,  29.498 msec,  30.740 msec
2025-03-16 18:01:08,683      Python               =  73.219 msec,  73.573 msec,  74.483 msec
2025-03-16 18:01:08,683  ### Benchmark 'bm_coroutines' (min/median/max):
2025-03-16 18:01:08,683      origin/3.0.x         = 171.114 msec, 171.783 msec, 173.794 msec
2025-03-16 18:01:08,683      origin/master        =  93.204 msec,  94.040 msec,  98.909 msec
2025-03-16 18:01:08,683      Python               = 113.180 msec, 113.359 msec, 114.323 msec
2025-03-16 18:01:08,683  ### Benchmark 'bm_fib' (min/median/max):
2025-03-16 18:01:08,683      origin/3.0.x         =  22.046 msec,  22.076 msec,  22.120 msec
2025-03-16 18:01:08,683      origin/master        =  21.862 msec,  21.920 msec,  22.044 msec
2025-03-16 18:01:08,683      Python               =    1.215 sec,    1.218 sec,    1.224 sec
2025-03-16 18:01:08,683  ### Benchmark 'bm_generators' (min/median/max):
2025-03-16 18:01:08,683      origin/3.0.x         =   2.142 msec,   2.200 msec,   2.627 msec
2025-03-16 18:01:08,683      origin/master        =   2.265 msec,   2.280 msec,   2.428 msec
2025-03-16 18:01:08,683      Python               =   6.757 msec,   6.789 msec,   7.314 msec
2025-03-16 18:01:08,683  ### Benchmark 'bm_meteor_contest' (min/median/max):
2025-03-16 18:01:08,683      origin/3.0.x         =  63.879 msec,  64.197 msec,  65.194 msec
2025-03-16 18:01:08,683      origin/master        =  62.812 msec,  63.578 msec,  63.834 msec
2025-03-16 18:01:08,683      Python               =  79.620 msec,  79.710 msec,  80.021 msec
2025-03-16 18:01:08,683  ### Benchmark 'bm_nbody' (min/median/max):
2025-03-16 18:01:08,683      origin/3.0.x         = 100.562 msec, 100.662 msec, 100.831 msec
2025-03-16 18:01:08,683      origin/master        =  66.390 msec,  66.773 msec,  67.106 msec
2025-03-16 18:01:08,683      Python               = 115.132 msec, 116.452 msec, 117.587 msec
2025-03-16 18:01:08,683  ### Benchmark 'bm_nqueens' (min/median/max):
2025-03-16 18:01:08,684      origin/3.0.x         =  32.170 msec,  32.293 msec,  32.388 msec
2025-03-16 18:01:08,684      origin/master        =  32.699 msec,  32.763 msec,  33.076 msec
2025-03-16 18:01:08,684      Python               =  66.299 msec,  66.556 msec,  66.804 msec
2025-03-16 18:01:08,684  ### Benchmark 'bm_raytrace' (min/median/max):
2025-03-16 18:01:08,684      origin/3.0.x         =    1.520 sec,    1.521 sec,    1.532 sec
2025-03-16 18:01:08,684      origin/master        =    1.592 sec,    1.600 sec,    1.609 sec
2025-03-16 18:01:08,684      Python               = 966.737 msec, 970.437 msec, 986.784 msec
2025-03-16 18:01:08,684  ### Benchmark 'bm_unpack_sequence' (min/median/max):
2025-03-16 18:01:08,684      origin/3.0.x         =   2.484 msec,   2.499 msec,   2.582 msec
2025-03-16 18:01:08,684      origin/master        =   2.326 msec,   2.327 msec,   2.339 msec
2025-03-16 18:01:08,684      Python               =   9.379 msec,   9.407 msec,   9.466 msec

So, for most of these (simple) benchmarks, it doesn't look that bad. The raytrace benchmark seems worth looking into, although that's not a regression, at least. Might also be a pessimisation due to static typing. But there are also a few others that got slower since 3.0.x. We should make sure that we didn't do something stupid here. Users shouldn't have a reason to stick with 3.0.x.

scoder avatar Mar 16 '25 18:03 scoder

The raytrace benchmark is dominated by attribute lookups:

   7,93%  python3          python3.12                                   [.] _PyObject_GenericGetAttrWithDict
   4,26%  python3          python3.12                                   [.] _PyType_Lookup

I'm running these with CPython 3.12/3.13 which received a lot of internal optimisations for all sorts of standard object operations, clearly including attribute lookups. We should see if we can catch up somehow.

scoder avatar Mar 16 '25 19:03 scoder

According to callgrind, 10% of the runtime of the bm_async_generators benchmark are spent in 10% of the calls to __Pyx_async_gen_anext() that set up the asyncio hooks with __Pyx_async_gen_init_hooks() (which executes a simple Python function). That seems worth optimising in asyncio, not in Cython. :-/

It also appears that StopIteration tends to have a traceback, which is costly and useless in almost all relevant cases. Worth investigating if that's us adding one or CPython. (EDIT: According to gdb, __Pyx_AddTraceback() is never called.)

Furthermore, __Pyx__ReturnWithStopIteration() tries to look up a vectorcall function for the StopIteration instantiation and fails, falling back to PyObject_VectorcallDict() in __Pyx_PyObject_FastCallDict(). That seems costly. Maybe we can speed up the exception creation here? At least, trying to do a vectorcall that never succeeds is unnecessarily inefficient. It might even be faster to call PyErr_SetObject(exc_type, return_value) – ~although CPython seems to do the same thing there (trying vectorcall first)~. Edit: Ah, CPython 3.14 implements vectorcall for exceptions. 3.12 doesn't yet. That makes a difference, because we always try vectorcall first, and 3.12/13 end up losing performance.

scoder avatar Mar 17 '25 20:03 scoder

__Pyx__ReturnWithStopIteration() tries to look up a vectorcall function for the StopIteration instantiation and fails

https://github.com/cython/cython/pull/6738

scoder avatar Mar 18 '25 04:03 scoder

It's interesting to see that our fstrings implementation apparently got slower between Python 3.10 and 3.12:

2025-03-18 16:10:46  ### Benchmark 'bm_fstrings' (min/median/max):
2025-03-18 16:10:46      Cython 'origin/3.0.x'     =   1.121 msec,   1.124 msec,   1.135 msec
2025-03-18 16:10:46      Cython 'origin/master'    =   1.144 msec,   1.147 msec,   1.287 msec
2025-03-18 16:10:46      Python 3.10.16            =   1.420 msec,   1.431 msec,   1.439 msec

2025-03-18 16:14:51  ### Benchmark 'bm_fstrings' (min/median/max):
2025-03-18 16:14:51      Cython 'origin/3.0.x'     =   1.803 msec,   1.866 msec,   2.408 msec
2025-03-18 16:14:51      Cython 'origin/master'    =   1.742 msec,   1.753 msec,   1.778 msec
2025-03-18 16:14:51      Python 3.12.9             =   1.614 msec,   1.631 msec,   1.705 msec

2025-03-18 16:22:46  ### Benchmark 'bm_fstrings' (min/median/max):
2025-03-18 16:22:46      Cython 'origin/3.0.x'     =   1.804 msec,   1.815 msec,   1.834 msec
2025-03-18 16:22:46      Cython 'origin/master'    =   1.865 msec,   1.884 msec,   1.913 msec
2025-03-18 16:22:46      Python 3.14.0             =   1.662 msec,   1.681 msec,   1.728 msec

It looks different if we switch to formatting C numbers, that's much faster than letting Python do it. The above is mostly formatting Python object values. (Timings are from https://github.com/cython/cython/commit/826d852716af77ed407a7e1cfda6d516a13e784b)

scoder avatar Mar 18 '25 17:03 scoder

I included the Limited C-API in the benchmark runner to compare against normal Cython. It's currently only enabled for Py3.14.

scoder avatar Mar 19 '25 13:03 scoder

Turns out that we disabled vectorcall for extension types and builtins back when they didn't support it, and didn't enable it when they started to support it. I found this when looking at the bm_chaos benchmark that uses a couple of extension types. https://github.com/cython/cython/pull/6744

scoder avatar Mar 20 '25 10:03 scoder

is dis fixed

tooptoop4 avatar May 11 '25 01:05 tooptoop4

is [th]is fixed [?]

Not all of it. But you should do your own benchmarks to see how Cython 3.1 performs for your specific code.

scoder avatar May 14 '25 07:05 scoder