rstar Make node constructors public, add RTree::new_from

[x] I agree to follow the project's code of conduct.
[x] I added an entry to rstar/CHANGELOG.md if knowledge of this change could be valuable to users.

This PR makes some internal APIs public to allow users to implement custom tree serialization and deserialization. I am currently using this functionality to save and load large rstar RTrees to a custom compact representation in order to deliver them over the network.

Jan 27 '23 11:01 nickguletskii

While there is no unsafe code in this crate, should this document some of the expected invariants for these constructors which should be upheld to avoid unexpected panics down the road?

Jan 27 '23 13:01 adamreichold

@adamreichold Thanks, completely forgot about that. I've added some warnings to the documentation of RTree::new_from_root to dissuade users from using this method to load custom/modified/spliced trees. I've also adding a warning that the trees may not be compatible between different versions of this crate. This should prevent this method from becoming an additional maintenance burden.

Jan 28 '23 18:01 nickguletskii

I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?

May 29 '23 22:05 urschrei

I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?

I am happy with the state of the changes but I am still undecided whether we should merge this at all. We did never discuss alternatives, for example driving custom Serializer and Deserializer implementations using the existing serde support. But at the end of the day, I don't want to stand in the way as we pressed for time as it is and I fear the above discussion will not happen just due to this.

May 30 '23 04:05 adamreichold

In that case I'm happy to continue to let it bake for now.

May 30 '23 09:05 urschrei

I'd like to merge this and then cut a new release this week. Are people happy with the state of this PR?

I am happy with the state of the changes but I am still undecided whether we should merge this at all. We did never discuss alternatives, for example driving custom Serializer and Deserializer implementations using the existing serde support. But at the end of the day, I don't want to stand in the way as we pressed for time as it is and I fear the above discussion will not happen just due to this.

Sorry for the late reply.

I don't really understand how one would provide a custom implementation for Serialize and Deserialize without the changes in this PR, since creating a custom Deserialize implementation for a wrapper type would still require a function to construct an RTree from the deserialized root node, or am I missing something?

Jun 13 '23 16:06 nickguletskii

I'm also for merging this as is; it helps a reasonable use-case mentioned above.

@nickguletskii could you provide us more details on the compact representation that required custom serialization?

Jun 13 '23 17:06 rmanoka

I don't really understand how one would provide a custom implementation for Serialize and Deserialize without the changes in this PR, since creating a custom Deserialize implementation for a wrapper type would still require a function to construct an RTree from the deserialized root node, or am I missing something?

My comment was alluding to the serializer, not the Serialize impl, i.e. if your custom format is implement as a Serde format, then you can just plug into the existing Serialize impl of RTree. This might not be possible but that is hard to say without knowing more about the format. Note that I do not suggest you implement a generic Serde format that could be published separately, just that implement enough of Serde to use the existing interfaces.

Jun 13 '23 17:06 adamreichold

I'm also for merging this as is; it helps a reasonable use-case mentioned above.

@nickguletskii could you provide us more details on the compact representation that required custom serialization?

Sure! I'm storing leafs separately from the tree inside an arena, and serializing only the topology of the tree (naively or using a something like a Prüfer code), without the AABBs. During deserialization, I reconstruct the AABBs on-the-fly. This significantly reduces the size of the tree when transferred over the network, and reduces the CPU time and allocations necessary to load it in comparison to performing bulk_load on the source data.

Jun 13 '23 18:06 nickguletskii

During deserialization, I reconstruct the AABBs on-the-fly.

This sounds like something we'd want to integrate into the generic Serde implementation.

Jun 13 '23 18:06 adamreichold

I don't really understand how one would provide a custom implementation for Serialize and Deserialize without the changes in this PR, since creating a custom Deserialize implementation for a wrapper type would still require a function to construct an RTree from the deserialized root node, or am I missing something?

My comment was alluding to the serializer, not the Serialize impl, i.e. if your custom format is implement as a Serde format, then you can just plug into the existing Serialize impl of RTree. This might not be possible but that is hard to say without knowing more about the format. Note that I do not suggest you implement a generic Serde format that could be published separately, just that implement enough of Serde to use the existing interfaces.

I think it should be possible to implement a Deserializer that only implements deserialize_struct and deserialize_u64 to deserialize the RTreeStruct. However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root function is doing. Besides, what if someone is using something like rkyv instead of serde? serde support is behind a feature flag in this crate.

This sounds like something we'd want to integrate into the generic Serde implementation.

I'll try to find some time to contribute and benchmark a custom implementation of Serialize and Deserialize for RTree that skip AABBs.

However, I would still appreciate new_from_root being upstreamed because removing AABBs from the serialized trees is still only a part of what I'm doing. The other optimizations are probably too domain-specific and integrating them would be impractical and involve a lot of breaking changes.

Jun 13 '23 20:06 nickguletskii

However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root function is doing.

I can see merit to this argument and I also did not write that we must go for a custom deserializer, just that we should discuss alternatives before committing to an API which basically makes no guarantees and needs two warning notes.

Consider the description of your approach in https://github.com/georust/rstar/pull/109#issuecomment-1589814951, do you think it at all possible to implement a safe (in the sense domain correctness) API which takes the topology (in some suitable representation), recomputes the AABB of parent nodes and the fly and necessarily produce a valid r* tree or fails?

(As an example of the direction I am thinking about, you could probably pass a root node where all envelopes in all parent nodes are empty and recompute those envelopes in the constructor which would at least ensure that the envelopes stored in the tree are consistent without significant overhead compared to computing them upfront and the constructor relying on them being correct. Of course, I am not sure if it possible to validate topology in similar manner with reasonable efficiency.)

Jun 13 '23 20:06 adamreichold

However, that just seems like a hacky and non-obvious way of doing the same thing as what the proposed RTree::new_from_root function is doing.

I can see merit to this argument and I also did not write that we must go for a custom deserializer, just that we should discuss alternatives before committing to an API which basically makes no guarantees and needs two warning notes.

Consider the description of your approach in #109 (comment), do you think it at all possible to implement a safe (in the sense domain correctness) API which takes the topology (in some suitable representation), recomputes the AABB of parent nodes and the fly and necessarily produce a valid r* tree or fails?

(As an example of the direction I am thinking about, you could probably pass a root node where all envelopes in all parent nodes are empty and recompute those envelopes in the constructor which would at least ensure that the envelopes stored in the tree are consistent without significant overhead compared to computing them upfront and the constructor relying on them being correct. Of course, I am not sure if it possible to validate topology in similar manner with reasonable efficiency.)

The current API proposed in this PR already doesn't allow the user to provide bogus envelopes for parent nodes, since ParentNode::new_parent calculates its envelope on its own. This part is actually safer than, for example, deserializing a tree from serde right now, because the deserializer copies the serialized, potentially bogus envelope without recalculating it from the children. The warning about AABBs in the documentation warns the user against changing the leaf envelopes, e.g. reusing the tree after rotating the geometry indexed by the RTree. There's no way to verify that this hasn't happened, just like there's no way to verify that the user has implemented RTreeObject::envelope correctly (i.e. it remains immutable after the object is inserted into a tree).

The tricky part is checking the RTree invariants. We could always call something like the sanity check used in the tests (sans the envelope checks), but that would be an extra tree traversal.

Alternatively, we could make a node builder struct that would keep track of node subtree depth and verify the child counts. This is how I would probably do this if I wanted to allow the user to build their own RTrees manually. That, however, was not the intended purpose of the proposed APIs - I just wanted a way to load an R* tree generated by rstar from my own custom format with minimal overhead, akin to what is possible through serde, which is why the warnings mention that the tree must originate from the current version of rstar. If you want, I can probably find time to make a prototype for the node builder, but I think that adding it would imply that arbitrary trees satisfying the R* tree invariants are supported by this library.

Jun 14 '23 18:06 nickguletskii

That, however, was not the intended purpose of the proposed APIs - I just wanted a way to load an R* tree generated by rstar from my own custom format with minimal overhead, akin to what is possible through serde, which is why the warnings mention that the tree must originate from the current version of rstar.

I understand that, but we have to consider what happens when such an API is made available, e.g. what kind of bug reports we will be getting by opening this door. There are costs to you-are-holding-it-wrong API and hence I have a strong desire for safer alternatives.

Especially since thinking back when this was an unsafe API, there are already ways to circumvent the type system which do not need upstream support:

struct MyRTreeLookalike {
 root: rstar::ParentNode<MyObject>,
 size: usize,
}

let tree = unsafe { std::mem::transmute::<rstar::RTree<MyObject, rstar::DefaultParams>>(MyRTreeLookalike { root, size }) };

(assuming that new_root and new_parent were public).

Of course, the ergonomics of this are horrible as you now have to ensure that MyRTreeLookalike actually stays layout-compatible with the actual RTree type, but the problem does stay local to the crate in question.

Of course, at the end of day, I will probably just throw in the towel and stop arguing against merging this, because as you point out even if we do add a safe API to build up arbitrary trees, then

If you want, I can probably find time to make a prototype for the node builder, but I think that adding it would imply that arbitrary trees satisfying the R* tree invariants are supported by this library.

which means that we still have to be very very careful if we ever write any unsafe code in this crate to rely on nothing but the invariants that we can actually check through these API which might still be too high a cost when taking into account that someone has to build them. (I think it would be unfair to ask that of you because this was not what you were trying to achieve.)

Jun 14 '23 18:06 adamreichold

Stepping back, it looks like the std data-structs like btreemap are indeed being serialized as a sequence, and not the tree itself. Ideally we should switch to the packed serialization suggested in this PR. This would need:

pre-order traversal of the leaf nodes, to make a vector
store the split indices / equivalent info of the tree-hierarchy to fast-load
provide pub interface to fast-load an r-tree from (Vec, split-indices)

@nickguletskii Do you think you can contribute this / something similar? Would be quite useful for many others I believe. Would also circumevent the issues pointed above with wanting to make internal functions pub.

On the pub issue, how about an unsafe pub wrapper around an internal pub(crate); this way, we'll only need one unsafe usage. May be we could try to just allow(unsafe_code) in one module if that's possible.

Jun 15 '23 03:06 rmanoka

Stepping back, it looks like the std data-structs like btreemap are indeed being serialized as a sequence, and not the tree itself. Ideally we should switch to the packed serialization suggested in this PR. This would need:

I think what you propose would certainly be nice as an efficency improvement to our internal serialization support, but I think it goes beyond the linked implementations as those just build up the tree structures upon deserialization using the public API which has considerable overhead. I think the equivalent would be using bulk_load for deserialization which is not what is requested here.

Finally, the API taking the vector of leaves and split indices basically has the same issue as the one proposed here if made public. It is not clear if the resulting tree is a valid R* tree and therefore which invariants other code in this crate can assume when operating on this tree.

On the pub issue, how about an unsafe pub wrapper around an internal pub(crate); this way, we'll only need one unsafe usage. May be we could try to just allow(unsafe_code) in one module if that's possible.

I'll just withdraw my objections. As long as we have no unsafe code elsewhere, the possible fallout of this is random panics or incorrect results when someone constructs an invalid tree using this API. When we want to use unsafe code in the implementation, we should probably reconsider whether to mark this function unsafe, but then we should also be able to explicitly describe the properties it must uphold so that our unsafe code can rely on them.

Jun 15 '23 06:06 adamreichold

rstar
rstar copied to clipboard

Make node constructors public, add RTree::new_from_root

rstar rstar copied to clipboard

Make node constructors public, add RTree::new_from_root

rstar
rstar copied to clipboard