rstar icon indicating copy to clipboard operation
rstar copied to clipboard

RTree::bulk_load does not comply with MAX_SIZE in higher dimensions

Open AlphaNow37 opened this issue 6 months ago • 4 comments

This code, using points with a lot of dimensions

let pts = (0..100_000).map(|i| [i as f32; 100]).collect::<Vec<_>>();
let tree = RTree::bulk_load(pts);
dbg!(tree.root().children().len());

prints 100_000 which is far above MAX_SIZE=6

in fact, even with 3 dimensions, there can be 2^3 = 8 children in some internal nodes:

let pts = (0..1000).map(|i| [i as f32; 3]).collect::<Vec<_>>();
let tree = RTree::bulk_load(pts);
dbg!(tree.root().children().len());

AlphaNow37 avatar Jul 13 '25 21:07 AlphaNow37

Thank you for reporting.

If I understand correctly, MAX_SIZE is intended to be the largest any node can ever be, it seems like a bug in bulk_load_recursive. The node sizes output from that seem to grow exponentially as the number of dimensions in the input increases.

Note this is true even for 2-D points

   #[test]
    fn test_2d_bulk_load() {
        let pts = (0..25).map(|i| [i as f32; 2]).collect::<Vec<_>>();
        let tree = RTree::bulk_load(pts);
        // > tree.root().children().len() = 9
        dbg!(tree.root().children().len());
        assert!(tree.root().children().len() <= DefaultParams::MAX_SIZE);
    }

michaelkirk avatar Aug 12 '25 22:08 michaelkirk

Based on observing behavior with printf, it seems like currently the root can have up to: (MAX_SIZE / dimensions) ^ dimensions children:

  • 2d: ceil(6 / 2)^2 = 9
  • 3d: ceil(6 / 3)^3 = 8
  • 4d: ceil(6 / 4)^4 = 16
  • ...
  • 100d: ceil(6 / 100)^100 = 1.27e30 😬

michaelkirk avatar Aug 12 '25 22:08 michaelkirk

Mostly just some notes at this point: bulk_load uses the OMT strategy for loading.

(Disclaimer: I'm not a very frequent contributor to this crate, and it's my first time looking at that paper)

The implementation in the paper is specific to 2-d points.

My understanding is that, the OMT loading strategy should generalize, and be efficient for querying in higher dimensions too.

I think there are two issues:

  1. A simple transcription bug: https://github.com/georust/rstar/blob/4488c142c78e28ee7bc8163f06b26642db5d3958/rstar/src/algorithm/bulk_load/cluster_group_iterator.rs#L66

I think that corresponds to this line: Image

I think the call to ceil should be floor, which is a simple fix, and that change alone resolves the broken 2D behavior I posted upthread. But it doesn't resolve the issue with larger dimensions.

  1. The trickier issue, as I understand it (and I'm actually not entirely sure that I do), I think we need to be able to partition along each dimensional axis for OMT to work, which implies that the minimum node size must be at least 2^num_dimensions. Which further implies that OMT is not a suitable loading algorithm for very high dimensional data.

Can anyone more knowledgable weigh in?

michaelkirk avatar Aug 13 '25 00:08 michaelkirk

And only somewhat relatedly, the default node size seems pretty small - especially if I'm understanding correctly and OMT effectively rounds it to the nearest power of 2, that means that initially nodes will only have 4 children.

If I'm correct about the transcription error and we address it, it will result in effectively smaller "max node size", and so we should consider correspondingly updating the default max node size.

michaelkirk avatar Aug 13 '25 00:08 michaelkirk