pyo3 icon indicating copy to clipboard operation
pyo3 copied to clipboard

Pickle Support

Open SerialDev opened this issue 7 years ago • 18 comments

As of right now its not possible to pickle classes created by PyO3.
This feature would be invaluable for situations where some form of persistence would be desireable.

As of right now it has trouble pickling after I call

    #[new]
    fn __new__(obj: &PyRawObject) -> PyResult<()>{

Otherwise the .__dict__ attributes are maintained prior to initialization with __new__

SerialDev avatar Dec 27 '17 13:12 SerialDev

Would this mean implementing __getstate__ and __setstate__ methods (cf https://docs.python.org/3/library/pickle.html#pickling-class-instances)?

For instance, the way pickling works for,

might provide some examples.

rth avatar Dec 30 '18 10:12 rth

For instance, if we take the documentation example for MyClass,

# use pyo3::prelude::*;
# use pyo3::PyRawObject;
#[pyclass]
struct MyClass {
   num: i32,
}

#[pymethods]
impl MyClass {

     #[new]
     fn new(obj: &PyRawObject, num: i32) {
         obj.init({
             MyClass {
                 num,
             }
         });
     }
}

by default we get the following error when pickling this class,

        obj = MyClass()
    
>       pickle.dumps(obj)
E       TypeError: can't pickle MyClass objects

If we now add the __getstate__/ __setstate__ methods,

    fn __getstate__(&self) -> PyResult<(i32)> {
        Ok(self.num)
    }

    fn __setstate__(&mut self, state: i32) -> PyResult<()> {
        self.num = state;
        Ok(())
    }

we get another exception,

_pickle.PicklingError: Can't pickle <class 'MyClass'>: attribute lookup MyClass on builtins failed

There is some additional step I must be missing here.

rth avatar May 05 '19 21:05 rth

@rth : this may be related to the fact that PyO3 exposes all classes as part of the builtins module, because the import mechanism has not been properly implemented, so pickle tries to use builtins.MyClass and fails with the error you reported.

althonos avatar May 06 '19 04:05 althonos

Thanks @althonos ! Opened a separate issue about it in #474

rth avatar May 06 '19 07:05 rth

So by subclassing , to set the __module__ correctly as suggested in https://github.com/PyO3/pyo3/issues/474#issuecomment-489521285, pickling seems to work.

Though, I get a segfault occasionally (i.e. it does seem to be random) at exit. For instance when running a pytest session where one test checks pickling,

gdb --args python3.7 -m pytest -k test_pickle
GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
[...]
Reading symbols from /opt/_internal/cpython-3.7.1/bin/python3.7...(no debugging symbols found)...done.
(gdb) run
Starting program: /opt/_internal/cpython-3.7.1/bin/python3.7 -m pytest -k test_pickle
warning: Error disabling address space randomization: Operation not permitted
============================================================= test session starts =============================================================
platform linux -- Python 3.7.1, pytest-4.4.1, py-1.8.0, pluggy-0.9.0 -- /opt/_internal/cpython-3.7.1/bin/python3.7
cachedir: .pytest_cache
rootdir: /src/python
collected 1 items  / 1 selected                                                                                               

my_module/test_pickle.py::test_pickle PASSED

=================================================== 1 passed  in 0.13 seconds ===================================================
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
No stack.

and there is no backtrace. Will try to investigate it later.

rth avatar May 06 '19 07:05 rth

The segfault likely occurs because subclassing is broken

konstin avatar May 06 '19 10:05 konstin

How about trying dill? Pickle can't handle lots of pure python serialisation cases. https://pypi.org/project/dill/

gilescope avatar Nov 15 '19 06:11 gilescope

Not sure if it's interesting; this snippet just got shared on gitter. https://gist.github.com/ethanhs/fd4123487974c91c7e5960acc9aa2a77

davidhewitt avatar Feb 25 '20 07:02 davidhewitt

I've got a simple struct that I need to deepcopy. I'm trying to figure out how to pickle my struct (after getting the TypeError: cannot pickle error). The gist above shows how to do it for a single member, but I'm too much of a newb to see how to do this with multiple members.

I tried

pub fn __getstate__(&self, py: Python) -> PyResult<PyObject> {
        Ok(PyBytes::new(py, &serialize(&self.foo).unwrap()).to_object(py))
        Ok(PyBytes::new(py, &serialize(&self.bar).unwrap()).to_object(py))
    }

..but get an error "expected one of ., ;, ?, }, or an operator" after the first OK.

shaolo1 avatar Oct 19 '20 23:10 shaolo1

@shaolo1 I would just return the tuple of members:

    pub fn __getstate__(&self, py: Python) -> PyObject {
        (
            PyBytes::new(py, &serialize(&self.foo)?),
            PyBytes::new(py, &serialize(&self.bar)?),
        ).to_object(py)
    }

davidhewitt avatar Oct 24 '20 15:10 davidhewitt

@davidhewitt Thanks. I'll try that if I encounter it again. I got around the problem by just implementing deepcopy in the parent object and handling the copy there so that pickle support was not needed in my rust object.

shaolo1 avatar Oct 24 '20 16:10 shaolo1

I was able to enable pickling by writing the __getstate__, __setstate__, and __getnewargs__ magic methods in pymethods for a pure Rust project using bincode::{deserialize, serialize}. In __getnewargs__ you need to return a tuple of all the arguments __new__ will use on deserializaton, otherwise you'll see something like TypeError: MyStruct.__new__() missing 2 required positional arguments: 'my_first_arg' and 'my_second_arg'.

Here is a generic example:

pub fn __setstate__(&mut self, state: Vec<u8>) -> PyResult<()> {
    *self = deserialize(&state).unwrap();
    Ok(())
}
pub fn __getstate__(&self) -> PyResult<Vec<u8>> {
    Ok(serialize(&self).unwrap())
}
pub fn __getnewargs__(&self) -> PyResult<(f64, f64)> {
    Ok((self.my_first_arg, self.my_second_arg))
}

Also, here is a code example for the workaround @shaolo1 mentioned. Cloning for deepcopy may be faster than serializing & deserializing (which I guess is how Python deepcopies normally?), but I haven't tested that.

pub fn copy(&self) -> Self {self.clone()}
pub fn __copy__(&self) -> Self {self.clone()}
pub fn __deepcopy__(&self, _memo: &PyDict) -> Self {self.clone()}

That'll allow you to return a clone using copy.copy(), copy.deepcopy(), or by calling the .copy() method.

Edits:

  • Also important to note I needed to change #[pyclass] to #[pyclass(module = "mymodulename")]
  • It seems like bincode is performing rather slow, I'm trying to figure out how to use serde_bytes to speed things up. Maybe in conjuction with PyBytes? Though I want to avoid the GIL wherever I possibly can.

kylecarow avatar Aug 18 '22 15:08 kylecarow

Yes, Vec<u8> will cast each byte in turn into a Python list. I think you do need to use PyBytes here, and it's irrelevant that you want to avoid the GIL because these are Python methods you're implementing.

I think you want something like this:

pub fn __setstate__(&mut self, state: &PyBytes) -> PyResult<()> {
    *self = deserialize(state.as_bytes()).unwrap();
    Ok(())
}
pub fn __getstate__<'py>(&self, py: Python<'py>) -> PyResult<&'py PyBytes> {
    Ok(PyBytes::new(py, serialize(&self).unwrap()))
}
pub fn __getnewargs__(&self) -> PyResult<(f64, f64)> {
    Ok((self.my_first_arg, self.my_second_arg))
}

I would also strongly recommend you replace .unwrap() with conversion to actual PyResult errors :)

davidhewitt avatar Aug 19 '22 13:08 davidhewitt

Woah, yeah that sped up my round trip serializing and deserializing benchmark by 100x. And thanks for the tip about PyResult errors. I did have to modify __getstate__ ever so slightly to add a reference:

pub fn __getstate__<'py>(&self, py: Python<'py>) -> PyResult<&'py PyBytes> {
    Ok(PyBytes::new(py, &serialize(&self).unwrap()))
}

I also did some benchmarking with my structs regarding the performance of cloning vs. roundtrip pickling and bincode serde, which might be useful to someone:

  • Having a __deepcopy__ pymethod that calls .clone() is by far the fastest way I've found of copying a pyo3 object. My benchmark took 1.38 usec
  • The next best thing is having bincode serde methods, which roundtrip took 15.6 usec (before the PyBytes change it took 1.28 msec)
    pub fn to_bincode<'py>(&self, py: Python<'py>) -> PyResult<&'py PyBytes> {
        Ok(PyBytes::new(py, &serialize(&self).unwrap()))
    }
    #[classmethod]
    pub fn from_bincode(_cls: &PyType, encoded: &PyBytes) -> PyResult<Self> {
        Ok(deserialize(encoded.as_bytes()).unwrap())
    }
    
  • The least performant is pickling, as expected. I guess Python has a lot more overhead here. It took 439 usec roundtrip.

kylecarow avatar Aug 22 '22 15:08 kylecarow

~since __setstate__ requires a mutable reference is there a possibility to have a pickle support for a #[pyclass(frozen)] class?~

never mind, I've switched to __reduce__ method

https://github.com/lycantropos/rithm/blob/765d1990800d47e169f84912b16a9857c0575fff/src/lib.rs#L441-L449

lycantropos avatar Jun 13 '23 03:06 lycantropos

You can also use __getnewargs__ or __getnewargs_ex__, which is the simplest option if you can pass all your state directly back to #[new] when unpickling (I would guess this is true for most frozen classes).

davidhewitt avatar Jun 13 '23 06:06 davidhewitt