rstar
rstar copied to clipboard
Make node constructors public, add RTree::new_from_root
- [x] I agree to follow the project's code of conduct.
- [x] I added an entry to
rstar/CHANGELOG.md
if knowledge of this change could be valuable to users.
This PR makes some internal APIs public to allow users to implement custom tree serialization and deserialization. I am currently using this functionality to save and load large rstar RTrees to a custom compact representation in order to deliver them over the network.
While there is no unsafe
code in this crate, should this document some of the expected invariants for these constructors which should be upheld to avoid unexpected panics down the road?
@adamreichold Thanks, completely forgot about that. I've added some warnings to the documentation of RTree::new_from_root
to dissuade users from using this method to load custom/modified/spliced trees. I've also adding a warning that the trees may not be compatible between different versions of this crate. This should prevent this method from becoming an additional maintenance burden.
I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?
I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?
I am happy with the state of the changes but I am still undecided whether we should merge this at all. We did never discuss alternatives, for example driving custom Serializer
and Deserializer
implementations using the existing serde
support. But at the end of the day, I don't want to stand in the way as we pressed for time as it is and I fear the above discussion will not happen just due to this.
In that case I'm happy to continue to let it bake for now.
I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?
I am happy with the state of the changes but I am still undecided whether we should merge this at all. We did never discuss alternatives, for example driving custom Serializer and Deserializer implementations using the existing serde support. But at the end of the day, I don't want to stand in the way as we pressed for time as it is and I fear the above discussion will not happen just due to this.
Sorry for the late reply.
I don't really understand how one would provide a custom implementation for Serialize
and Deserialize
without the changes in this PR, since creating a custom Deserialize
implementation for a wrapper type would still require a function to construct an RTree
from the deserialized root node, or am I missing something?
I'm also for merging this as is; it helps a reasonable use-case mentioned above.
@nickguletskii could you provide us more details on the compact representation that required custom serialization?
I don't really understand how one would provide a custom implementation for Serialize and Deserialize without the changes in this PR, since creating a custom Deserialize implementation for a wrapper type would still require a function to construct an RTree from the deserialized root node, or am I missing something?
My comment was alluding to the serializer, not the Serialize
impl, i.e. if your custom format is implement as a Serde format, then you can just plug into the existing Serialize
impl of RTree
. This might not be possible but that is hard to say without knowing more about the format. Note that I do not suggest you implement a generic Serde format that could be published separately, just that implement enough of Serde to use the existing interfaces.
I'm also for merging this as is; it helps a reasonable use-case mentioned above.
@nickguletskii could you provide us more details on the compact representation that required custom serialization?
Sure! I'm storing leafs separately from the tree inside an arena, and serializing only the topology of the tree (naively or using a something like a Prüfer code), without the AABBs. During deserialization, I reconstruct the AABBs on-the-fly. This significantly reduces the size of the tree when transferred over the network, and reduces the CPU time and allocations necessary to load it in comparison to performing bulk_load
on the source data.
During deserialization, I reconstruct the AABBs on-the-fly.
This sounds like something we'd want to integrate into the generic Serde implementation.
I don't really understand how one would provide a custom implementation for Serialize and Deserialize without the changes in this PR, since creating a custom Deserialize implementation for a wrapper type would still require a function to construct an RTree from the deserialized root node, or am I missing something?
My comment was alluding to the serializer, not the
Serialize
impl, i.e. if your custom format is implement as a Serde format, then you can just plug into the existingSerialize
impl ofRTree
. This might not be possible but that is hard to say without knowing more about the format. Note that I do not suggest you implement a generic Serde format that could be published separately, just that implement enough of Serde to use the existing interfaces.
I think it should be possible to implement a Deserializer
that only implements deserialize_struct
and deserialize_u64
to deserialize the RTreeStruct
. However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root
function is doing. Besides, what if someone is using something like rkyv instead of serde? serde support is behind a feature flag in this crate.
This sounds like something we'd want to integrate into the generic Serde implementation.
I'll try to find some time to contribute and benchmark a custom implementation of Serialize
and Deserialize
for RTree
that skip AABBs.
However, I would still appreciate new_from_root
being upstreamed because removing AABBs from the serialized trees is still only a part of what I'm doing. The other optimizations are probably too domain-specific and integrating them would be impractical and involve a lot of breaking changes.
However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root function is doing.
I can see merit to this argument and I also did not write that we must go for a custom deserializer, just that we should discuss alternatives before committing to an API which basically makes no guarantees and needs two warning notes.
Consider the description of your approach in https://github.com/georust/rstar/pull/109#issuecomment-1589814951, do you think it at all possible to implement a safe (in the sense domain correctness) API which takes the topology (in some suitable representation), recomputes the AABB of parent nodes and the fly and necessarily produce a valid r* tree or fails?
(As an example of the direction I am thinking about, you could probably pass a root node where all envelopes in all parent nodes are empty and recompute those envelopes in the constructor which would at least ensure that the envelopes stored in the tree are consistent without significant overhead compared to computing them upfront and the constructor relying on them being correct. Of course, I am not sure if it possible to validate topology in similar manner with reasonable efficiency.)
However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root function is doing.
I can see merit to this argument and I also did not write that we must go for a custom deserializer, just that we should discuss alternatives before committing to an API which basically makes no guarantees and needs two warning notes.
Consider the description of your approach in #109 (comment), do you think it at all possible to implement a safe (in the sense domain correctness) API which takes the topology (in some suitable representation), recomputes the AABB of parent nodes and the fly and necessarily produce a valid r* tree or fails?
(As an example of the direction I am thinking about, you could probably pass a root node where all envelopes in all parent nodes are empty and recompute those envelopes in the constructor which would at least ensure that the envelopes stored in the tree are consistent without significant overhead compared to computing them upfront and the constructor relying on them being correct. Of course, I am not sure if it possible to validate topology in similar manner with reasonable efficiency.)
The current API proposed in this PR already doesn't allow the user to provide bogus envelopes for parent nodes, since ParentNode::new_parent
calculates its envelope on its own. This part is actually safer than, for example, deserializing a tree from serde right now, because the deserializer copies the serialized, potentially bogus envelope without recalculating it from the children. The warning about AABBs in the documentation warns the user against changing the leaf envelopes, e.g. reusing the tree after rotating the geometry indexed by the RTree. There's no way to verify that this hasn't happened, just like there's no way to verify that the user has implemented RTreeObject::envelope
correctly (i.e. it remains immutable after the object is inserted into a tree).
The tricky part is checking the RTree invariants. We could always call something like the sanity check used in the tests (sans the envelope checks), but that would be an extra tree traversal.
Alternatively, we could make a node builder struct that would keep track of node subtree depth and verify the child counts. This is how I would probably do this if I wanted to allow the user to build their own RTrees manually. That, however, was not the intended purpose of the proposed APIs - I just wanted a way to load an R* tree generated by rstar from my own custom format with minimal overhead, akin to what is possible through serde, which is why the warnings mention that the tree must originate from the current version of rstar. If you want, I can probably find time to make a prototype for the node builder, but I think that adding it would imply that arbitrary trees satisfying the R* tree invariants are supported by this library.
That, however, was not the intended purpose of the proposed APIs - I just wanted a way to load an R* tree generated by rstar from my own custom format with minimal overhead, akin to what is possible through serde, which is why the warnings mention that the tree must originate from the current version of rstar.
I understand that, but we have to consider what happens when such an API is made available, e.g. what kind of bug reports we will be getting by opening this door. There are costs to you-are-holding-it-wrong API and hence I have a strong desire for safer alternatives.
Especially since thinking back when this was an unsafe
API, there are already ways to circumvent the type system which do not need upstream support:
struct MyRTreeLookalike {
root: rstar::ParentNode<MyObject>,
size: usize,
}
let tree = unsafe { std::mem::transmute::<rstar::RTree<MyObject, rstar::DefaultParams>>(MyRTreeLookalike { root, size }) };
(assuming that new_root
and new_parent
were public).
Of course, the ergonomics of this are horrible as you now have to ensure that MyRTreeLookalike
actually stays layout-compatible with the actual RTree
type, but the problem does stay local to the crate in question.
Of course, at the end of day, I will probably just throw in the towel and stop arguing against merging this, because as you point out even if we do add a safe API to build up arbitrary trees, then
If you want, I can probably find time to make a prototype for the node builder, but I think that adding it would imply that arbitrary trees satisfying the R* tree invariants are supported by this library.
which means that we still have to be very very careful if we ever write any unsafe code in this crate to rely on nothing but the invariants that we can actually check through these API which might still be too high a cost when taking into account that someone has to build them. (I think it would be unfair to ask that of you because this was not what you were trying to achieve.)
Stepping back, it looks like the std data-structs like btreemap are indeed being serialized as a sequence, and not the tree itself. Ideally we should switch to the packed serialization suggested in this PR. This would need:
- pre-order traversal of the leaf nodes, to make a vector
- store the split indices / equivalent info of the tree-hierarchy to fast-load
- provide pub interface to fast-load an r-tree from (Vec, split-indices)
@nickguletskii Do you think you can contribute this / something similar? Would be quite useful for many others I believe. Would also circumevent the issues pointed above with wanting to make internal functions pub.
On the pub issue, how about an unsafe pub
wrapper around an internal pub(crate)
; this way, we'll only need one unsafe
usage. May be we could try to just allow(unsafe_code)
in one module if that's possible.
Stepping back, it looks like the std data-structs like btreemap are indeed being serialized as a sequence, and not the tree itself. Ideally we should switch to the packed serialization suggested in this PR. This would need:
I think what you propose would certainly be nice as an efficency improvement to our internal serialization support, but I think it goes beyond the linked implementations as those just build up the tree structures upon deserialization using the public API which has considerable overhead. I think the equivalent would be using bulk_load
for deserialization which is not what is requested here.
Finally, the API taking the vector of leaves and split indices basically has the same issue as the one proposed here if made public. It is not clear if the resulting tree is a valid R* tree and therefore which invariants other code in this crate can assume when operating on this tree.
On the pub issue, how about an unsafe pub wrapper around an internal pub(crate); this way, we'll only need one unsafe usage. May be we could try to just allow(unsafe_code) in one module if that's possible.
I'll just withdraw my objections. As long as we have no unsafe
code elsewhere, the possible fallout of this is random panics or incorrect results when someone constructs an invalid tree using this API. When we want to use unsafe
code in the implementation, we should probably reconsider whether to mark this function unsafe
, but then we should also be able to explicitly describe the properties it must uphold so that our unsafe
code can rely on them.