Querying geographies (spatial index)
Similarly to shapely.STRTree, it would be nice if we could provide a way in spherely to create a spatial index of a (large) collection of geographies.
A few thoughts:
-
For a collection of single points, s2geometry provides a
S2PointIndex. Actually, I'd be happy to move https://github.com/benbovy/pys2index here, although I think it is useful to also expose it via a simple API like scipy.spatial.cKDTree that is not based on simple features. -
For a collection of arbitrary geometries, would it make sense to reuse
S2ShapeIndex? Which itself can be viewed as a quad-tree... ~~I don't thinkMutableS2ShapeIndexwill be very helpful since IIUC it takes ownership of the S2Shape objects added to the index~~ (EDIT: I misunderstoodS2Polyline::Shape,S2Polygon::Shape, etc. that are wrapper classes that do not take ownership of the wrapped S2 objects), but what aboutEncodedS2ShapeIndex? (although still not ideal to duplicate information in the index). Here we want to retrieve (query) s2geography geographies instead of s2geometry shapes, maybe s2geometry's tagging system would make it possible? If we don't allow adding or removing geographies after creating the index, maybe we could use the indices of the array of geographies as tags? @paleolimbot do you think https://github.com/paleolimbot/s2geography/pull/40 would help here?
It's probably not very efficient, but there is the GeographyIndex:
https://github.com/paleolimbot/s2geography/blob/21400cfc864203db7bb536f37bdbed3682e4140f/src/s2geography/index.h#L18-L23
You can see it in action in the R package:
https://github.com/r-spatial/s2/blob/b495b0df53bffc7dc1ad3780fe8ac208e78cd1bf/src/s2-matrix.cpp#L19
For points I think you do just want the point index (although I don't know if it will be faster to loop once and create the GeographyIndex or loop twice to check if everything is, in fact a point).
Ideally we'd improve on the GeographyIndex by only considering the cell covering of the input (rather than every edge of the input). Then you can build the index using multiple threads or workers (multi thread the calculation of the covering part and pickling/shipping the cell unions between processes is probably cheap). I am not sure if there's something built-in for this, whether we can use the MutableS2ShapeIndex for this, or whether we would have to roll our own.
It would be nice to be able to serialize the index, and if we can somehow use the MutableS2ShapeIndex to do that we get the ability to encode it for free. Serializing the index is a massively beneficial feature (you can build the index once and ship it to a bunch of worker processes or put it in a file), plus S2's encoded shape index is sufficiently lazy that you could probably memory map it and query it without using much memory.
Ah yes I forgot about s2geography::GeographyIndex, which is exactly what we need.
It would also be nice of s2geography::GeographyIndex could provide a high-level query API (e.g., by predicate or distance), using S2ContainsPointQuery, S2ClosestEdgeQuery, etc. under the hood. I.e., port things that you actually implemented in R s2.
Opened https://github.com/paleolimbot/s2geography/issues/45 as it probably makes more sense to continue the discussion about s2geography::GeographyIndex there.