pyo3 icon indicating copy to clipboard operation
pyo3 copied to clipboard

Implement `ToPyObject` for `bytes::Bytes`

Open Eric-Arellano opened this issue 2 years ago • 13 comments

The bytes crate from Tokio is really useful for operating on bytes in a performant way: https://docs.rs/crate/bytes/1.1.0.

We use it in Pants. I'd love to convert a #[pyfunction] we have to be able to stop acquiring the GIL inside of a future so that we can better use py.allow_threads(||). But lifetimes mean that we only have access to &[u8] in the future, so we would need to convert into Vec<u8>, which is more copying than we can tolerate, especially because iirc that Vec<u8> would then get copied again into the PyBytes?. https://github.com/pantsbuild/pants/blob/d62a3cd87c114e379748edbfa279f6fa11da6fd5/src/rust/engine/src/externs/interface.rs#L1399

Instead, I think I'm hoping to have that future convert the &[u8] into Bytes (which does zero copying). And then PyO3 will convert Bytes into PyBytes automatically outside of my #[pyfunction] - allowing me to use py.allow_threads() the whole time. Does that make sense?

I imagine this would be a new Cargo feature.

I'm very happy to contribute this!

Eric-Arellano avatar Nov 15 '21 04:11 Eric-Arellano

convert the &[u8] into Bytes (which does zero copying)

Only if the slice has a static lifetime. Otherwise the slice has to be copied regardless.

mejrs avatar Nov 15 '21 07:11 mejrs

This would be very welcome IMO! I've wondered about adding an optional bytes dependency myself in the past but never got around to it. There's a bunch of optional deps like num-bigint with conversions added in src/conversions - you could add a new bytes.rs file following those examples. 👍

davidhewitt avatar Nov 15 '21 07:11 davidhewitt

Only if the slice has a static lifetime. Otherwise the slice has to be copied regardless.

Oh, hm, yeah you're right: https://docs.rs/bytes/1.1.0/bytes/struct.Bytes.html#method.copy_from_slice Good point. We can possibly change that internal API to return Bytes though rather than &[u8], though, so this feature request would still be really helpful to Pants.

you could add a new bytes.rs file following those examples. 👍

Sweet! Thanks.

Eric-Arellano avatar Nov 15 '21 07:11 Eric-Arellano

I don't think we need anything new for this. Converting Bytes into PyBytes works today (specifically, Bytes autodereferences to &[u8]):

use pyo3::prelude::*;
use pyo3::types::PyBytes;
use bytes::Bytes;

fn main() {
    let bytes = Bytes::from_static(&[1,2,3,4]);
    
    Python::with_gil(|py|{
        let pb = PyBytes::new(py, &bytes);
        dbg!(pb);
    })
}

Looking at the capi and Bytes' implementation it doesn't look like we can do something clever here.

mejrs avatar Nov 15 '21 08:11 mejrs

Bytes may not work as a return value from #[pyfunction] without an IntoPy<PyObject> trait implementation? (And similarly as a #[pyfunction] argument.)

davidhewitt avatar Nov 15 '21 08:11 davidhewitt

That is true, and it would be more convenient. But it has some footgunny behavior - I can totally see someone starting with a Vec<u8>, converting it to a Bytes (clone one), which then gets converted (second clone) into a PyBytes. While it would in fact be more performant to just use PyBytes::new(py, &my_vec) (cloning only once).

I don't think it would be a disaster or anything, but I think it would be a mistake to offer this api if we can't implement it in a clever way.

Perhaps this is best offered in a container-agnostic way, like impl IntoPy<Py<PyBytes>> for AsRef<[u8]> (would that work? I haven't tried), which also requires an explicit conversion.

mejrs avatar Nov 15 '21 09:11 mejrs

I believe Bytes::from(vec) just takes ownership of the underlying buffer rather than having to copy the memory (but I could be wrong).

impl IntoPy<Py<PyBytes>> for AsRef<[u8]>

Implementations like this are possible, though I kinda don't like them because I think each Rust type should map to only one Python type. In the future I would like to remove the generic parameter from IntoPy and instead have an associated type, but last time I looked at that it needs min_specialization to be able to deal with some edge cases like () mapping to Python tuple except in return values where it should map to None.

davidhewitt avatar Nov 15 '21 09:11 davidhewitt

I believe Bytes::from(vec) just takes ownership of the underlying buffer rather than having to copy the memory (but I could be wrong).

It goes through a boxed slice, which can reallocate if there is excess capacity.

mejrs avatar Nov 15 '21 10:11 mejrs

It goes through a boxed slice, which can reallocate if there is excess capacity.

Ah interesting, I wasn't aware of that. I guess that's one of those impossible API questions where different users want different things 🤷

EDIT: actually thinking about it, not reallocating and continuing to hold the extra capacity is a perf optimization that would confuse a lot of users seeing excess memory consumption, so std probably makes the right choice.

davidhewitt avatar Nov 15 '21 12:11 davidhewitt

I'm thinking the best way to do this is to create an object that wraps a Bytes and implements the buffer protocol?

mejrs avatar Nov 18 '21 08:11 mejrs

That would be very efficient!

Because of the complexity of #1444 I shy away from utility pyclasses in libraries at the moment. It's definitely something we need to figure out, however at the moment each extension module would get its own copy of such a type and they wouldn't compare equal etc.

davidhewitt avatar Nov 18 '21 08:11 davidhewitt

(However we could create an example for users to adapt in their own code? Also in #1884 I hope to offer a safe way to implement buffer protocol using #[pymethods] magic methods soon...)

davidhewitt avatar Nov 18 '21 08:11 davidhewitt

An example sounds totally reasonable to me!

I definitely would not want adding this convenience to impede future foundational improvements to PyO3.

Eric-Arellano avatar Nov 18 '21 08:11 Eric-Arellano