wasmtime-py icon indicating copy to clipboard operation
wasmtime-py copied to clipboard

performance: calling python from wasm is too slow

Open muayyad-alsadi opened this issue 1 year ago • 1 comments

as mentioned in parent performance ticket https://github.com/bytecodealliance/wasmtime-py/issues/96#issuecomment-1497336349

we have a bottleneck when python is called from inside wasm

here is a minimum reproducible example

unlike #96 which is to be solved by #139 the loop here lives inside wasm and the functions are in python, C (python's math) and wasm this issue is not solved by #139

experiment setup

(module
  (func $log_i (import "env" "log_i") (param i32))
  (func $math_gcd (import "env" "math_gcd") (param i32 i32) (result i32))
  (func $python_gcd (import "env" "python_gcd") (param i32 i32) (result i32))
  (func $gcd (export "gcd") (param i32 i32) (result i32)
  ;; ...
  )
  (func $wasm_gcd_loop (export "wasm_gcd_loop") (param $n i32) (param $a i32) (param $b i32) (result i32)
  ;; call wasm_gcd(a,b) n times
  )
  (func $math_gcd_loop (export "math_gcd_loop") (param $n i32) (param $a i32) (param $b i32) (result i32)
  ;; call math_gcd(a,b) n times
  )
  (func $python_gcd_loop (export "python_gcd_loop") (param $n i32) (param $a i32) (param $b i32) (result i32)
  ;; call python_gcd(a,b) n times
  )
)

I've logged i and logged gcd to make sure loop is working fine using log_i() I've removed the logging I've imported C gcd from python's math package I've imported python gcd

and did a loop inside wasm calling the 3 versions

print("*** wasm loop benchmark: ")
for name in "wasm_gcd_loop", "math_gcd_loop", "python_gcd_loop":
    gcdf = by_name[name]
    start_time = time.perf_counter()
    g = gcdf(N, a, b)
    total_time = time.perf_counter() - start_time
    print(total_time, "\t\t", name)

results

9.899199358187616e-05            wasm_gcd_loop
0.01052837198949419              math_gcd_loop
0.012672090000705793             python_gcd_loop

conclusion

when wasm loop call's wasm function it was too fast, it was 100x faster than calling C's gcd since C's gcd is faster than wasm gcd as shown in other benchmark, then this bottleneck is wrapping python part.

profiling

here is how we profile calling C from wasm

pr=Profile()
pr.enable()
print(math_gcd_loop(N, a, b))
pr.disable()
pr.print_stats()
pr.print_stats('cumulative')
pr.print_stats('calls')

using the production 7.0.0, the obvious bootleneck is in:

  • _func.py:172(trampoline)
  • _value.py:129(_convert)
  Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7024    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     3019    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
     2006    0.001    0.000    0.001    0.000 _bindings.py:178(wasm_valtype_kind)
     2005    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
     2001    0.001    0.000    0.001    0.000 _value.py:167(_value)
     1009    0.000    0.000    0.000    0.000 {built-in method _ctypes.POINTER}
     1008    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x7fc0d01430a0}
     1007    0.001    0.000    0.001    0.000 _types.py:45(_from_ptr)
     1007    0.001    0.000    0.002    0.000 _types.py:86(__del__)
     1004    0.000    0.000    0.000    0.000 _value.py:109(__init__)
     1004    0.000    0.000    0.000    0.000 _value.py:117(__del__)
     1004    0.000    0.000    0.000    0.000 _value.py:161(_unwrap_raw)
     1003    0.001    0.000    0.001    0.000 _bindings.py:172(wasm_valtype_new)
     1003    0.002    0.000    0.002    0.000 _value.py:39(i32)
     1003    0.002    0.000    0.012    0.000 _value.py:129(_convert)
     1003    0.001    0.000    0.003    0.000 _types.py:12(i32)
     1003    0.001    0.000    0.002    0.000 _types.py:54(__eq__)
     1003    0.001    0.000    0.001    0.000 _bindings.py:120(wasm_valtype_delete)
     1000    0.000    0.000    0.000    0.000 _func.py:121(__init__)
     1000    0.005    0.000    0.020    0.000 _func.py:172(trampoline)
     1000    0.000    0.000    0.001    0.000 _func.py:245(get)
     1000    0.000    0.000    0.001    0.000 _value.py:156(_into_raw)
     1000    0.000    0.000    0.000    0.000 typing.py:1737(cast)
     1000    0.000    0.000    0.000    0.000 {built-in method math.gcd}
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.delattr}

by time

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.022    0.022 _func.py:59(__call__)
        1    0.002    0.002    0.022    0.022 _bindings.py:2491(wasmtime_func_call)
     1000    0.005    0.000    0.020    0.000 _func.py:172(trampoline)
     1003    0.002    0.000    0.012    0.000 _value.py:129(_convert)
     1003    0.001    0.000    0.003    0.000 _types.py:12(i32)
     1003    0.001    0.000    0.002    0.000 _types.py:54(__eq__)
     1003    0.002    0.000    0.002    0.000 _value.py:39(i32)
     1007    0.001    0.000    0.002    0.000 _types.py:86(__del__)
     2006    0.001    0.000    0.001    0.000 _bindings.py:178(wasm_valtype_kind)
     1007    0.001    0.000    0.001    0.000 _types.py:45(_from_ptr)
     2001    0.001    0.000    0.001    0.000 _value.py:167(_value)
     1000    0.000    0.000    0.001    0.000 _value.py:156(_into_raw)
     1003    0.001    0.000    0.001    0.000 _bindings.py:120(wasm_valtype_delete)
     1003    0.001    0.000    0.001    0.000 _bindings.py:172(wasm_valtype_new)
     7024    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     1000    0.000    0.000    0.001    0.000 _func.py:245(get)

muayyad-alsadi avatar Apr 05 '23 11:04 muayyad-alsadi