RTree::bulk_load does not comply with MAX_SIZE in higher dimensions
This code, using points with a lot of dimensions
let pts = (0..100_000).map(|i| [i as f32; 100]).collect::<Vec<_>>();
let tree = RTree::bulk_load(pts);
dbg!(tree.root().children().len());
prints 100_000 which is far above MAX_SIZE=6
in fact, even with 3 dimensions, there can be 2^3 = 8 children in some internal nodes:
let pts = (0..1000).map(|i| [i as f32; 3]).collect::<Vec<_>>();
let tree = RTree::bulk_load(pts);
dbg!(tree.root().children().len());
Thank you for reporting.
If I understand correctly, MAX_SIZE is intended to be the largest any node can ever be, it seems like a bug in bulk_load_recursive. The node sizes output from that seem to grow exponentially as the number of dimensions in the input increases.
Note this is true even for 2-D points
#[test]
fn test_2d_bulk_load() {
let pts = (0..25).map(|i| [i as f32; 2]).collect::<Vec<_>>();
let tree = RTree::bulk_load(pts);
// > tree.root().children().len() = 9
dbg!(tree.root().children().len());
assert!(tree.root().children().len() <= DefaultParams::MAX_SIZE);
}
Based on observing behavior with printf, it seems like currently the root can have up to: (MAX_SIZE / dimensions) ^ dimensions children:
- 2d:
ceil(6 / 2)^2 = 9 - 3d:
ceil(6 / 3)^3 = 8 - 4d:
ceil(6 / 4)^4 = 16 - ...
- 100d:
ceil(6 / 100)^100 = 1.27e30😬
Mostly just some notes at this point: bulk_load uses the OMT strategy for loading.
(Disclaimer: I'm not a very frequent contributor to this crate, and it's my first time looking at that paper)
The implementation in the paper is specific to 2-d points.
My understanding is that, the OMT loading strategy should generalize, and be efficient for querying in higher dimensions too.
I think there are two issues:
- A simple transcription bug: https://github.com/georust/rstar/blob/4488c142c78e28ee7bc8163f06b26642db5d3958/rstar/src/algorithm/bulk_load/cluster_group_iterator.rs#L66
I think that corresponds to this line:
I think the call to ceil should be floor, which is a simple fix, and that change alone resolves the broken 2D behavior I posted upthread. But it doesn't resolve the issue with larger dimensions.
- The trickier issue, as I understand it (and I'm actually not entirely sure that I do), I think we need to be able to partition along each dimensional axis for OMT to work, which implies that the minimum node size must be at least
2^num_dimensions. Which further implies that OMT is not a suitable loading algorithm for very high dimensional data.
Can anyone more knowledgable weigh in?
And only somewhat relatedly, the default node size seems pretty small - especially if I'm understanding correctly and OMT effectively rounds it to the nearest power of 2, that means that initially nodes will only have 4 children.
If I'm correct about the transcription error and we address it, it will result in effectively smaller "max node size", and so we should consider correspondingly updating the default max node size.