ibis
ibis copied to clipboard
feat: support stable semantics hash for ibis expr
This is something we did with our internal use of ibis and broken recently with the newer ibis version, so I'd like to discuss the possibility of adding this support in ibis itself.
Semantics Hash
The idea of semantics hash is that we want to compute an hash to present the result of an ibis expr so they can be cached / communicated between different program that uses ibis expr.
Example: (1) Program A pass an ibis expr to Program B as input (2) Program B computes the semantics hash on the ibis expr (3) Program B checks if it already knows the result of that ibis expr using the semantics hash, and potentially uses the cached result for the computation.
Internal Implementation
Internally, we have effectively implemented a version of semantics hash on ibis node. Our implementation is basically use cloudpickle to serialize the object and then use hashlib.sha256 to hash the bytes. This has worked recently well for us, but because this is maintained outside of ibis, this is easily broken (recently broke by the change to make _safe_name from a property to a cached property because that causes the same object produces a different pickling results depending on whether _safe_name is called)
Proposed change in Ibis
I think it makes sense to add this feature in ibis for easier maintenance for us as well as benefiting other project that uses ibis. The main change I'd like to propose in ibis is to support stable picking for ibis expr. By stable picking I mean that ibis expr that equals should result in the same bytes when being pickled (assuming the same cloud pickle version) , and then, supporting semantics hashing is relatively easy by just hashing the bytes.
At the minimal, I think I'd like at least to support stable picking for ibis expr, meaning that cached properties and attrs (such like _cached_expr) doesn't affect the result of the serialized bytes and we can build semantics hash internally based on that. Considering that ibis expr/ops are already immutable/hashable, adding stable pickling or semantics hash would seem a nice additional.
We are happy to take this work since we have pretty much this implemented already internally.
Thoughts? @cpcloud
cc @emilyreff
A stable hash within a single Ibis version makes a lot of sense. Does it really have to be stable across Ibis versions though? That seems like it would be unnecessarily restrictive for many changes (and would not be guaranteed stable across major versions anyway).
It doesn't need to be stable across ibis version (at least from our use cases). The hash is usually used for duration of a couple of days to 1-2 weeks. And even the hash does change it is not a big deal, we can always recompute (waste some computation but still correct result). I think the idea of semantics hash is best effort rather than guarantee.
What changes would need to be made to ibis to support this?
I don't think there will be a lot.
I think some changes need to be made do ser/de of ibis expr/operations (basically ignore all cached attrs like _hash), making sure things like cached_property on _safe_name doesn't affect the pickling and tests.
Would hashing Node.getstate() be sufficient?
This could be interesting, but I'm not sure what needs to be done and the issue has been open without movement for quite a while.
Closing as stale.