tantivy-py icon indicating copy to clipboard operation
tantivy-py copied to clipboard

Updated `pythonize` and `pyo3`

Open jamesbraza opened this issue 1 year ago • 12 comments

Closes https://github.com/quickwit-oss/tantivy-py/issues/371

jamesbraza avatar Dec 17 '24 04:12 jamesbraza

The reason the pickle test fails is because of something that is different in the pythonize package in the new version. When a Document object is deserialized, values in the json object that were i64, become u64 after deserialization:

Comparing:

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": I64(1), "b": I64(-2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": I64(-2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

Look carefully at the types in the json object. Before serialization, I64(1), but after deserialization, U64. This seems to only happen with the JSON object. If that is commented (.add_json) in the test then the test passes.

cjrh avatar Jan 08 '25 13:01 cjrh

Yeah I concur with your breakdown that the serialization/deserialization cycle is considering the JSON's a key to be unsigned now.

If that is commented (.add_json) in the test then the test passes.

Fwiw, going to assert orig.to_dict() == pickled.to_dict() leads to the test passing as well. So it's something within __eq__ specifically.


How are you getting that printed representation with I64/U64? Using repr or to_dict I don't see those values.

Also, Document has no __eq__ method, so I am not yet understanding how the __eq__ check is actually failing.

jamesbraza avatar Feb 17 '25 22:02 jamesbraza

If the assert .to_dict() change makes the test pass, I am ok to proceed with that. On the python side there is no difference between I64 and u64, nor in a json representation of those payloads if they ever get serialized for anything. What do you reckon?

cjrh avatar Feb 17 '25 23:02 cjrh

I think it's important for __eq__ to work across serialization/deserialization.

What do you think of implementing a Document.__eq__ to invoke to_dict:

class Document:
    ...

    def __eq__(self, other) -> bool:
        if not isinstance(other, type(self)):
            return NotImplemented
        return self.to_dict() == other.to_dict()

Also, it may be preferred to do this in document.rs with the Rust API, as opposed to in Python only.

jamesbraza avatar Feb 17 '25 23:02 jamesbraza

Overriding __eq__ makes me nervous.

I think it's important for eq to work across serialization/deserialization.

Good point. Perhaps we should make a reproducer of the issue using only pythonize, and ask from help from that project.

cjrh avatar Feb 17 '25 23:02 cjrh

I would be happy to do so, if you can point out how you printed the Document such that U64/I64 stuff showed up.

Also, just so I can understand, how do you know the issue is with pythonize vs pyo3?

jamesbraza avatar Feb 18 '25 00:02 jamesbraza

if you can point out how you printed the Document such that U64/I64 stuff showed up.

From memory I put print statements somewhere. Hopefully I still have that code somewhere. I will check tomorrow.

cjrh avatar Feb 18 '25 01:02 cjrh

To print out that data, I added a print inside __richcmp__ like this:

impl Document {
    fn __richcmp__<'py>(
        &self,
        other: &Self,
        op: CompareOp,
        py: Python<'py>
    ) -> PyResult<Bound<'py, PyAny>> {
        println!("\n\n\nComparing:\n\n{:?}\n\n{:?}", self.field_values, other.field_values);
        match op {
            CompareOp::Eq => {
                let v = (self == other).into_pyobject(py)?.to_owned().into_any();
                Ok(v)
            },
            CompareOp::Ne => {
                let v = (self != other).into_pyobject(py)?.to_owned().into_any();
                Ok(v)
            },
            _ => {
                let v = PyNotImplemented::get(py).to_owned().into_any();
                Ok(v)
            }
        }
    }
}

Tracing backwards from here, I also added some prints in the _internal_from_pythonized() method:

impl Document {
    #[staticmethod]
    fn _internal_from_pythonized(serialized: &Bound<PyAny>) -> PyResult<Self> {
        println!("\n\n\nDeserializing: {:?}", serialized);
        let out = pythonize::depythonize(serialized).map_err(to_pyerr);
        let out: Document = out.unwrap();
        println!("\n\n\nDeserialized: {:?}", out);
        println!("\n\n\nDeserialized: {:?}", out.field_values);
        Ok(out)
    }
}

So when I run the tests I get this stdout:

------------------------------------- Captured stdout call -------------------------------------



Deserializing: {'birth': [{'Date': 1565614805000000000}], 'bytes': [{'Bytes': [97, 98, 99]}], 'facet': [{'Facet': '/europe/france'}], 'float': [{'F64': 1.0}], 'integer': [{'I64': 5}], 'json': [{'Object': {'a': 1, 'b': 2}}], 'title': [{'Str': 'hello world!'}], 'unsigned': [{'U64': 1}]}



Deserialized: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])



Deserialized: {"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": U64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}
orig: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])
pickled: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])



Comparing:

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": I64(1), "b": I64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": U64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}

cjrh avatar Feb 18 '25 10:02 cjrh

Okay I made https://github.com/jamesbraza/tantivy-py/pull/1, which:

  1. PR description states my current understanding of the problem (newer pythonize more aggressively making U64)
  2. Poses a somewhat hacky solution, basically checking if one can convert back to I64

I am not happy with this, but I am really in over my head here, so I made https://github.com/davidhewitt/pythonize/issues/80 basically as a SOS.

jamesbraza avatar Mar 02 '25 23:03 jamesbraza

Perhaps extract_value at https://github.com/quickwit-oss/tantivy-py/blob/0.22.0/src/document.rs#L37-L39 has something to do with it? I don't see a U64 case, more just talking out loud

jamesbraza avatar Mar 02 '25 23:03 jamesbraza

@jamesbraza It would be great if you could add a very short reproducer to your issue on pythonize, so that David can easily run it and see the change in type during roundtrip.

cjrh avatar Mar 03 '25 12:03 cjrh

and

but I am really in over my head here

Don't worry, we're all learning all the time. Thanks for your continued interest in helping here ❤️

cjrh avatar Mar 03 '25 12:03 cjrh

I pushed a commit with a new strategy: for the JSON type, deserialize all integers to I64. This solves our problem with the pickle test, and without breaking any handling for toplevel I64/U64 integers. This is probably the best we can do here.

I will also try to update this branch on top of current master.

cjrh avatar May 04 '25 11:05 cjrh

Successfully merged the main branch in.

cjrh avatar May 04 '25 12:05 cjrh

My apologies for the two month hiatus here, good to see you guys pushed this forward.

To catch myself up and concrete my understanding, just giving a tl;dr on this PR's story:

  1. With newer pythonize, we found I64 objects after deserialization become U64, breaking Document.__eq__.
  2. This happens is because newer pythonize::depythonize prioritizes deserialization to unsigned ints over signed ints, per https://github.com/davidhewitt/pythonize/pull/69.
  3. I posed https://github.com/jamesbraza/tantivy-py/pull/1, which involves expanding Document.__richcmp__ to recursively cast unsigned integers to signed integers.
  4. In https://github.com/davidhewitt/pythonize/issues/80, David poses adding a config option for depythonize to prefer signed ints over unsigned ints.
  5. Caleb just added a solution of, upon deserialization of Object, recursively casting numbers to I64 (if possible), otherwise F64.
    • This is more opinionated than https://github.com/jamesbraza/tantivy-py/pull/1 because it eliminates unsigned values in JSON. It's also more "tantivy-py native" by directly using Object (SerdeValue enum's item for JSON), instead of Document.__richcmp__.

In practice this change is backwards compatible ❤️ as far as I am concerned. The only incompatibility is round trip ser/de's an old Document will convert any unsigned ints to signed ints in JSON, but that doesn't matter. I am down to ship this 🚢 .


In other news, it looks like pythonize::depythonize_bound was just removed in pythonize version 0.24 in https://github.com/davidhewitt/pythonize/pull/83, so I just updated the code for that. This brings cargo build to pass.

jamesbraza avatar May 04 '25 19:05 jamesbraza