Updated `pythonize` and `pyo3`
Closes https://github.com/quickwit-oss/tantivy-py/issues/371
The reason the pickle test fails is because of something that is different in the pythonize package in the new version. When a Document object is deserialized, values in the json object that were i64, become u64 after deserialization:
Comparing:
{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": I64(1), "b": I64(-2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}
{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": I64(-2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}
Look carefully at the types in the json object. Before serialization, I64(1), but after deserialization, U64. This seems to only happen with the JSON object. If that is commented (.add_json) in the test then the test passes.
Yeah I concur with your breakdown that the serialization/deserialization cycle is considering the JSON's a key to be unsigned now.
If that is commented (
.add_json) in the test then the test passes.
Fwiw, going to assert orig.to_dict() == pickled.to_dict() leads to the test passing as well. So it's something within __eq__ specifically.
How are you getting that printed representation with I64/U64? Using repr or to_dict I don't see those values.
Also, Document has no __eq__ method, so I am not yet understanding how the __eq__ check is actually failing.
If the assert .to_dict() change makes the test pass, I am ok to proceed with that. On the python side there is no difference between I64 and u64, nor in a json representation of those payloads if they ever get serialized for anything. What do you reckon?
I think it's important for __eq__ to work across serialization/deserialization.
What do you think of implementing a Document.__eq__ to invoke to_dict:
class Document:
...
def __eq__(self, other) -> bool:
if not isinstance(other, type(self)):
return NotImplemented
return self.to_dict() == other.to_dict()
Also, it may be preferred to do this in document.rs with the Rust API, as opposed to in Python only.
Overriding __eq__ makes me nervous.
I think it's important for eq to work across serialization/deserialization.
Good point. Perhaps we should make a reproducer of the issue using only pythonize, and ask from help from that project.
I would be happy to do so, if you can point out how you printed the Document such that U64/I64 stuff showed up.
Also, just so I can understand, how do you know the issue is with pythonize vs pyo3?
if you can point out how you printed the Document such that U64/I64 stuff showed up.
From memory I put print statements somewhere. Hopefully I still have that code somewhere. I will check tomorrow.
To print out that data, I added a print inside __richcmp__ like this:
impl Document {
fn __richcmp__<'py>(
&self,
other: &Self,
op: CompareOp,
py: Python<'py>
) -> PyResult<Bound<'py, PyAny>> {
println!("\n\n\nComparing:\n\n{:?}\n\n{:?}", self.field_values, other.field_values);
match op {
CompareOp::Eq => {
let v = (self == other).into_pyobject(py)?.to_owned().into_any();
Ok(v)
},
CompareOp::Ne => {
let v = (self != other).into_pyobject(py)?.to_owned().into_any();
Ok(v)
},
_ => {
let v = PyNotImplemented::get(py).to_owned().into_any();
Ok(v)
}
}
}
}
Tracing backwards from here, I also added some prints in the _internal_from_pythonized() method:
impl Document {
#[staticmethod]
fn _internal_from_pythonized(serialized: &Bound<PyAny>) -> PyResult<Self> {
println!("\n\n\nDeserializing: {:?}", serialized);
let out = pythonize::depythonize(serialized).map_err(to_pyerr);
let out: Document = out.unwrap();
println!("\n\n\nDeserialized: {:?}", out);
println!("\n\n\nDeserialized: {:?}", out.field_values);
Ok(out)
}
}
So when I run the tests I get this stdout:
------------------------------------- Captured stdout call -------------------------------------
Deserializing: {'birth': [{'Date': 1565614805000000000}], 'bytes': [{'Bytes': [97, 98, 99]}], 'facet': [{'Facet': '/europe/france'}], 'float': [{'F64': 1.0}], 'integer': [{'I64': 5}], 'json': [{'Object': {'a': 1, 'b': 2}}], 'title': [{'Str': 'hello world!'}], 'unsigned': [{'U64': 1}]}
Deserialized: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])
Deserialized: {"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": U64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}
orig: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])
pickled: Document(birth=[2019-08-12],bytes=[[97, 98, 9],facet=[/europe/fr],float=[1],integer=[5],json=[{"a":1,"b"],title=[hello worl],unsigned=[1])
Comparing:
{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": I64(1), "b": I64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}
{"birth": [Date(2019-08-12T13:00:05Z)], "bytes": [Bytes([97, 98, 99])], "facet": [Facet(Facet(/europe/france))], "float": [F64(1.0)], "integer": [I64(5)], "json": [Object({"a": U64(1), "b": U64(2)})], "title": [Str("hello world!")], "unsigned": [U64(1)]}
Okay I made https://github.com/jamesbraza/tantivy-py/pull/1, which:
- PR description states my current understanding of the problem (newer
pythonizemore aggressively making U64) - Poses a somewhat hacky solution, basically checking if one can convert back to I64
I am not happy with this, but I am really in over my head here, so I made https://github.com/davidhewitt/pythonize/issues/80 basically as a SOS.
Perhaps extract_value at https://github.com/quickwit-oss/tantivy-py/blob/0.22.0/src/document.rs#L37-L39 has something to do with it? I don't see a U64 case, more just talking out loud
@jamesbraza It would be great if you could add a very short reproducer to your issue on pythonize, so that David can easily run it and see the change in type during roundtrip.
and
but I am really in over my head here
Don't worry, we're all learning all the time. Thanks for your continued interest in helping here ❤️
I pushed a commit with a new strategy: for the JSON type, deserialize all integers to I64. This solves our problem with the pickle test, and without breaking any handling for toplevel I64/U64 integers. This is probably the best we can do here.
I will also try to update this branch on top of current master.
Successfully merged the main branch in.
My apologies for the two month hiatus here, good to see you guys pushed this forward.
To catch myself up and concrete my understanding, just giving a tl;dr on this PR's story:
- With newer
pythonize, we found I64 objects after deserialization become U64, breakingDocument.__eq__. - This happens is because newer
pythonize::depythonizeprioritizes deserialization to unsigned ints over signed ints, per https://github.com/davidhewitt/pythonize/pull/69. - I posed https://github.com/jamesbraza/tantivy-py/pull/1, which involves expanding
Document.__richcmp__to recursively cast unsigned integers to signed integers. - In https://github.com/davidhewitt/pythonize/issues/80, David poses adding a config option for
depythonizeto prefer signed ints over unsigned ints. - Caleb just added a solution of, upon deserialization of
Object, recursively casting numbers to I64 (if possible), otherwise F64.- This is more opinionated than https://github.com/jamesbraza/tantivy-py/pull/1 because it eliminates unsigned values in JSON. It's also more "
tantivy-pynative" by directly usingObject(SerdeValueenum's item for JSON), instead ofDocument.__richcmp__.
- This is more opinionated than https://github.com/jamesbraza/tantivy-py/pull/1 because it eliminates unsigned values in JSON. It's also more "
In practice this change is backwards compatible ❤️ as far as I am concerned. The only incompatibility is round trip ser/de's an old Document will convert any unsigned ints to signed ints in JSON, but that doesn't matter. I am down to ship this 🚢 .
In other news, it looks like pythonize::depythonize_bound was just removed in pythonize version 0.24 in https://github.com/davidhewitt/pythonize/pull/83, so I just updated the code for that. This brings cargo build to pass.