pyo3 Performance: calling overhead

Continued from conversation on long-dead #1607.

I've still seeing a significant overhead when calling

#[pyfunction]
pub fn slow_len(obj: &Bound<'_, PyAny>) -> PyResult<usize> {
    obj.len()
}

vs. a more "baremetal" implementation.

Timings:

In [2]: %timeit slow_len((1, 2, 3, 4))
33.5 ns ± 0.147 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: %timeit fast_len((1, 2, 3, 4))
12.6 ns ± 0.0159 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

I see this 20-40ns overhead in calling PyO3 functions in many scenarios.

Also my "baremetal" code is still using _PyCFunctionFastWithKeywords, not _PyCFunctionFast as there don't seem to be methods available to call that. I assume there might be further performance improvements available over my fast_len method if _PyCFunctionFast could be used?

Code

use pyo3::ffi;
use pyo3::ffi::{PyLong_FromSize_t, PyObject_Size};
use pyo3::prelude::*;
use pyo3::types::PyCFunction;

unsafe extern "C" fn fast_len(
    _self: *mut ffi::PyObject,
    args: *const *mut ffi::PyObject,
    nargs: ffi::Py_ssize_t,
    _kwnames: *mut ffi::PyObject,
) -> *mut ffi::PyObject {
    if nargs != 1 {
        ffi::PyErr_SetString(ffi::PyExc_TypeError, "expected one positional argument".as_ptr().cast());
        return std::ptr::null_mut();
    }
    let len = PyObject_Size(*args);
    if len == -1 {
        ffi::PyErr_SetString(ffi::PyExc_TypeError, "expected a sequence".as_ptr().cast());
        std::ptr::null_mut()
    } else {
        PyLong_FromSize_t(len as usize)
    }
}

static FAST_LEN: pyo3::PyMethodDef = pyo3::PyMethodDef::fastcall_cfunction_with_keywords(
    "fast_len",
    pyo3::methods::PyCFunctionFastWithKeywords(fast_len),
    "",
);

#[pyfunction]
pub fn slow_len(obj: &Bound<'_, PyAny>) -> PyResult<usize> {
    obj.len()
}

#[pymodule]
fn py_module(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(PyCFunction::internal_new(&FAST_LEN, m.into())?)?;
    m.add_function(wrap_pyfunction!(slow_len, m)?)?;
    Ok(())
}

I'm installing pyo3 with

pyo3 = {git = "https://github.com/davidhewitt/pyo3", branch="complete-py2"}

So I can use the new Bound API.

Feb 11 '24 23:02 samuelcolvin

Thanks! I'll see to break this down further in the morning 👍

Feb 11 '24 23:02 davidhewitt

Thanks!

My actual use case was struct/class methods where I think there's the same overhead, but I don't know how to implement t them without #[pyclass].

Feb 11 '24 23:02 samuelcolvin

We still construct a GIL Pool and poke and prod at it, right? It's just that it's always empty, in principle?

Feb 12 '24 00:02 alex

Note that we should be able to get there in 0.22, i.e. the gil-refs feature will not just silence deprecations of the old API but the old API and the pool backing it will be fully removed when that feature is disabled as we already experimented on in #3685.

(You should be able to see the effects already by rebasing that patch and adjust the feature name. It just is nothing we can ship because it is still possible to reach the unimplemented!() via the old API.

Also note we do still have the global reference count pool, i.e. handling calls to Py::clone without the GIL being held. This will not go away just via the new API and might still add overhead. I think it can go away in a nogil Python build since the more expensive cross-thread reference counting will then just be the general case for the Python reference counting itself.)

Feb 12 '24 06:02 adamreichold

I just took a quick look at this. The timings on my machine come out very similarly to @samuelcolvin with the original code, so I'll assume my new measurement is comparable.

I applied the same patch as in the bottom section of https://github.com/PyO3/pyo3/issues/3787#issuecomment-1918721168 and reran with this analysis. I also made sure LTO was enabled to get inlining to be aggressive as possible to make the generated code more similar to the "baremetal" snippet.

With that done, I measure ~17.5ns for the "slow" code above, with no changes on the user's end. So we can make a significant dent of the difference with a GIL Pool removed and also with the global Py reference counting replaced by nogil.

Regardless, ~17.5ns does imply a little bit of extra framework overhead over the "baremetal" still, but that's still cut the overall function execution by 50% from 33.5ns, so we've already made a huge impact. It's worth remembering that there's a couple of extra pieces of work which PyO3 does which are somewhat fundamental:

We wrap code in panic::catch_unwind to translate panics at the function boundary into Python exceptions.
We allow keyword arguments (you can add signature = (obj, /) above to disable them, but the underlying framework will still check there are no keywords).

Once the main bulk of the overheads are gone and we're into this 17.5ns regime it would be interesting to see if we can optimize further, but I'd be surprised if there was much more to be won.

I also wonder whether we could already move the global Py reference counting to a dedicated thread which attempts to wake and update counts at intervals, rather than doing this on every function call? In a Python with the GIL it will affect singlethreaded throughput a little, but once nogil is present that thread shouldn't affect throughput (and as @adamreichold says it might not be necessary at all).

Feb 13 '24 20:02 davidhewitt

I also wonder whether we could already move the global Py reference counting to a dedicated thread which attempts to wake and update counts at intervals, rather than doing this on every function call?

IIRC, we had soundness issues when those updates were missed before resuming GIL-dependent Rust code, e.g. https://github.com/PyO3/pyo3/commit/83f5fa2902a8376906362f2ddc18ad2fb867047c

Feb 13 '24 20:02 adamreichold

Right, yes. That would make it very difficult to do anything other than what we already do, at least until nogil comes along 👍

Feb 13 '24 20:02 davidhewitt

pyo3 pyo3 copied to clipboard

Performance: calling overhead

pyo3
pyo3 copied to clipboard