pyo3 icon indicating copy to clipboard operation
pyo3 copied to clipboard

Add performance suggestions to docs

Open samuelcolvin opened this issue 2 years ago • 9 comments

Hi, thanks so much for pyo3 - it's wonderful, I'm using it a lot (see rtoml and watchfiles, I might be using it pydantic soon too).

I wonder if it would be worth adding a "Performance Suggestions" or similar section to the docs, perhaps the FAQs?

Examples:

cast_as vs extract

I spent a lot of the morning very worried that pyo3 rust making heavy use of python types was actually slower than python, until by chance I tried cast_as instead of extract.

// fast
let list: &PyList = py_any.cast_as()?;
// slow
let list: &PyList = py_any.extract()?;

This change in my case made a ~4x improvement to performance.

Optimised compilation

While investigating the above, I noticed this SO question referenced on #1470 didn't mention that the performance is massively effected by debug/release compilation. I've added an answer to the question with suggestions.


These are both no doubt very obvious to the maintainers of this library and sessioned users, but to those new to rust and pyo3, they might be less obvious.

I'm sure there are other things which can have a dramatic effect on performance which I'm not ware of - I've love to hear any other general suggestions?


Update:

Other things to suggest when this section is written:

  • interning strings where they're being used a lot, a good example is if you're creating a dict with common keys, performance can be significantly improved by reusing PyStrings rather than rust strings

samuelcolvin avatar Apr 06 '22 10:04 samuelcolvin

I spent a lot of the morning very worried that pyo3 rust making heavy use of python types was actually slower than python, until by chance I tried cast_as instead of extract.

I am surprised by this. Shouldn't both cast_as and extract for a native type end up calling PyTryFrom::try_from? cast_as directly and extract via pyobject_native_type_extract! which should both end up here.

Is it possible to show that performance difference using a self-contained reproducer?

adamreichold avatar Apr 06 '22 10:04 adamreichold

I tried the following benchmark

diff --git a/benches/bench_frompyobject.rs b/benches/bench_frompyobject.rs
index 8cf8a4da7..ca4b80bb2 100644
--- a/benches/bench_frompyobject.rs
+++ b/benches/bench_frompyobject.rs
@@ -1,6 +1,6 @@
-use criterion::{criterion_group, criterion_main, Bencher, Criterion};
+use criterion::{black_box, criterion_group, criterion_main, Bencher, Criterion};
 
-use pyo3::{prelude::*, types::PyString};
+use pyo3::{prelude::*, types::{PyList, PyString}};
 
 #[derive(FromPyObject)]
 enum ManyTypes {
@@ -18,8 +18,30 @@ fn enum_from_pyobject(b: &mut Bencher<'_>) {
     })
 }
 
+fn list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).cast_as().unwrap();
+        });
+    })
+}
+
+fn list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).extract().unwrap();
+        });
+    })
+}
+
 fn criterion_benchmark(c: &mut Criterion) {
     c.bench_function("enum_from_pyobject", enum_from_pyobject);
+    c.bench_function("list_via_cast_as", list_via_cast_as);
+    c.bench_function("list_via_extract", list_via_extract);
 }
 
 criterion_group!(benches, criterion_benchmark);

and found no measurable difference:

list_via_cast_as        time:   [1.1823 ns 1.1823 ns 1.1823 ns]                              
list_via_extract        time:   [1.8916 ns 1.8918 ns 1.8920 ns]                              

Could it be that something besides using extract or cast_as changed?

adamreichold avatar Apr 06 '22 11:04 adamreichold

Code is currently closed source to avoid confusion, but if I can make it work it will end up being OS.

I'll invite you as a collaborator so you can see what I mean, feel free to ignore the invitation if perfer.

samuelcolvin avatar Apr 06 '22 11:04 samuelcolvin

I'll invite you as a collaborator so you can see what I mean, feel free to ignore the invitation if perfer.

Since the commits always change multiple things (like PyObject and <&PyAny>::extract to just &PyAny) or returning an error instead of raising an exception, I would suggest rolling back the mechanical replacements of extract by cast_as and checking whether it still makes a difference. (I expect it to not make a difference.)

EDIT: Just noticed the PR you opened doing exactly that.

adamreichold avatar Apr 06 '22 11:04 adamreichold

I wonder if it would be worth adding a "Performance Suggestions" or similar section to the docs, perhaps the FAQs?

Regardless of the outcome of this issue I think this is a great idea 👍

mejrs avatar Apr 06 '22 11:04 mejrs

So I think the problem is the following: The benchmarked code is basically the following

if let Ok(list) = any.extract::<&PyList>() { ... } else { ... }

where the else branch is always taken. extract returns PyResult<T> instead of Result<T, PyDowncastError> which means into must call From<PyDowncastError> for PyErr which will turn the PyDowncastError into a string here.

The difference is indeed significant as show by the benchmark results:

list_via_cast_as        time:   [3.0937 ns 3.2319 ns 3.3556 ns]                            
list_via_extract        time:   [3.0724 ns 3.0798 ns 3.0897 ns]                              
not_a_list_via_cast_as  time:   [3.4965 ns 3.4974 ns 3.4982 ns]           
not_a_list_via_extract  time:   [292.06 ns 292.80 ns 293.61 ns]                                   

given by the diff

diff --git a/benches/bench_frompyobject.rs b/benches/bench_frompyobject.rs
index 8cf8a4da7..86d2047df 100644
--- a/benches/bench_frompyobject.rs
+++ b/benches/bench_frompyobject.rs
@@ -1,6 +1,6 @@
-use criterion::{criterion_group, criterion_main, Bencher, Criterion};
+use criterion::{black_box, criterion_group, criterion_main, Bencher, Criterion};
 
-use pyo3::{prelude::*, types::PyString};
+use pyo3::{prelude::*, types::{PyList, PyString}};
 
 #[derive(FromPyObject)]
 enum ManyTypes {
@@ -18,8 +18,52 @@ fn enum_from_pyobject(b: &mut Bencher<'_>) {
     })
 }
 
+fn list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).cast_as().unwrap();
+        });
+    })
+}
+
+fn list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).extract().unwrap();
+        });
+    })
+}
+
+fn not_a_list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyString::new(py, "foobar").into();
+
+        b.iter(|| {
+            black_box(any).cast_as::<PyList>().unwrap_err();
+        });
+    })
+}
+
+fn not_a_list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyString::new(py, "foobar").into();
+
+        b.iter(|| {
+            black_box(any).extract::<&PyList>().unwrap_err();
+        });
+    })
+}
+
 fn criterion_benchmark(c: &mut Criterion) {
     c.bench_function("enum_from_pyobject", enum_from_pyobject);
+    c.bench_function("list_via_cast_as", list_via_cast_as);
+    c.bench_function("list_via_extract", list_via_extract);
+    c.bench_function("not_a_list_via_cast_as", not_a_list_via_cast_as);
+    c.bench_function("not_a_list_via_extract", not_a_list_via_extract);
 }
 
 criterion_group!(benches, criterion_benchmark);

This only affects the Err path, not the Ok. But then again, if-else-if-chains to determine the correct type are probably common in Rust-code-called-from-Python-code.

EDIT: I also tried replacing err.to_string() by String::new and this is indeed almost all of the slow down. Only a small part is due to PyTypeError::new_err itself.)

adamreichold avatar Apr 06 '22 11:04 adamreichold

Using something like

#[derive(FromPyObject)]
enum ListOrDictOrAny<'a> {
    List(&'a PyList),
    Dict(&'a PyDict),
    Any(&'a PyAny),
}

also does not help as the generated implementation is probably also an if-else-if chain of calls to extract.

adamreichold avatar Apr 06 '22 11:04 adamreichold

This is really interesting, thank you so much.

Stuff like this scares me a little, the explanation is beyond my knowledge. If I hadn't happened upon this fix, I would probably have abandoned the entire project assuming the performance gains I had hoped for weren't possible.

samuelcolvin avatar Apr 06 '22 11:04 samuelcolvin

Agreed, I've wanted to write some notes along these lines for a long time. I think the main thing holding me off has been lack of time, and I'd kind of been hoping I could make progress in #1308 first (which would reduce a few known PyO3 overheads and likely change a number of performance best practices).

davidhewitt avatar Apr 10 '22 09:04 davidhewitt

Another point for the performance section:

as per https://github.com/pydantic/pydantic-core/pull/501,

PyErr::new::<CustomError, _>(([args]))

is significantly slower than

let custom_error = CustomError::new([args]);
match Py::new(py, validation_error) {
    Ok(err) => PyErr::from_value(err.into_ref(py)),
    Err(err) => err,
}

For some errors, I think because PyErr::new converts args to python objects and back again.

That change (found by chance) improved some of our benchmarks by 30%.

samuelcolvin avatar Apr 01 '23 07:04 samuelcolvin

Superseded by #3310

davidhewitt avatar Jul 17 '23 21:07 davidhewitt