Geo-search
- Regarding #44.
This pull request adds spatial indexing to Tantivy using block kd-trees. The implementation tessellates polygons into triangles and indexes the triangles in the block kd-tree.
Search supports three query types: "intersects" finds documents whose geometries overlap a query rectangle, "within" finds documents whose geometries fall entirely inside the query bounds, and "contains" finds documents whose geometries completely enclose the query region. I've attempted to follow Tantivy's established patterns. Spatial fields use a new FieldType::Spatial variant. I used PreTokenizedString as a go-by.
The user provided f64 geographic coordinates are serialized using XOR compression that exploits spatial locality. When consecutive vertices are close together, their bit representations share high-order bits, and XOR reveals this redundancy as zeros that varint encoding can compress. The compression falls back to uncompressed when data is incompressible. Looked at alternatives, bitshuffle+zstd might be nice for large polygons, but small polygons would be smaller than a zstd block. The XOR is simple and a good place to start.
The block kd-tree implementation uses bulk-loaded immutable construction with recursive median partitioning, creating a somewhat balanced block kd-tree with 512 triangles per leaf. The tree stores triangles. Polygons are tessellated into triangles prior to indexing. f64 dimensions are converted to i32 prior to tessellation and the block kd-tree stores the dimensions as i32. Triangles are stored in an encoded format that places a bounding box for the triangle in the first four words, followed by the spare x/y, boundary flags and finally the doc_id. The boundary flags indicate if a side of the triangle shares a boundary with the tessellated polygon and is used for "within" and "contains" testing.
The tree construction uses Rust's standard library select_nth_unstable_by_key for median partitioning, operating directly on &mut [Triangle] slices. Tree construction is a simple recursive descent based around this standard library function. The recursive descent receives slices of progressively smaller sub-ranges until reaching leaf size. This approach works identically whether the triangles come from a Vec during initial indexing or from a memory-mapped file during merge.
Merge is implemented by writing these Triangle structures into a temporary file, memory mapping the temporary file, and then performing an unsafe cast of the memory to &mut [Traingle]. Because the file is memory mapped, the select_nth_unstable_by_key tree construction can rely on the operating system to manage the memory when merging large segments.
If you are zealously opposed to unsafe code then &mut [Triangle] and select_nth_unstable_by_key have to be rewritten, but I don't see how it is a problem. The Vec<Triangle> is Rust-safe. The temporary file is entirely implemented as a transient step in the block kd-tree merge. Ought to be able to add a compile switch for endianess. It is not a persistent file. It is temporary only for the merge. It doesn't not have to have the same endianess on different architectures.
This implementation is certainly Lucene inspired. The triangle encoding is a 1:1 port with the exception of replacing orient with the robust orient2d implementation of Shewchuk's 2d orient. The rest is more inspiration than port. Porting from Lucene 1:1 not sensible. Lucene needs to operate on mapped memory via Java arrays, it goes to great lengths to avoid the creation of POJOs to keep the garbage collector out of the picture. The Rust implementation can read more like Rust.
f64 to i32 is Lucene inspired, roughly centimeter precision. The bulk-loaded partition construction is Lucene inspired, but the in place sorting and reuse of the standard library is a feature of Rust. Delta encoding for i32/u32 is Lucene inspired. Currently using barycentric point-in-triangle instead of Lucene's orientation tests.
Dependences added are robust for orient2d and i_triangle for its integer-based Delaunay triangulation that tracks boundary edges.
I have avoiding making anything "pluggable" and bonded tightly to Tantivy and the chosen 3rd-party libraries for simplicity and easy reading.
In the above you'll see a discussion of how the use of select_nth_unstable_by_key with &mut [Triangle] against a memory mapped file that has been cast to &mut [Triangle], please consider it.
Please consider the minimalist Geometry enum. Rather than create yet another object hierarchy of Point, Polygon, BoundingBox, et. al. I used underlying representations that can be easily handed off to GeoRust or similar. I didn't want to make GeoRust or GeoJSON parsing a Tantivy dependency. The block kd-tree is indented for bounding box and point queries, not for the full range of queries you would find in PostGIS, like polygon in polygon. Such a test could be performed while iterating the search results.
Note that the block kd-tree is not going work with geometries that cross the antimeridian, but this is a known issue, and data sets will probably already split a polygon that crosses the antimeridian into a multi-polygon with a polygon for each side of the antimeridian. This is even in the GeoJSON spec.
Note that there is a Spatial field and a Geometry type stored in that field. Would it be better to have a Geographic or Geo field since the implementation is geo specific?
As I write this, "contains" and "within" are implemented in the block kd-tree but not exposed through the interfaces. I will implement "disjoint." There are todo!()s to implement in the Field::Spatial(_) match arms. There are some tree balancing improvements that I'd like to add. I'd like to put a lot of Geofabrik data through it and see how well it performs. However, it is ready for consideration by the Tantivy maintainers.
Run cargo run --example geo_json, a minimal round-trip through the index.
@flatheadmill This is a super interesting PR. A bit of contributions has accumulated, so it might take a bit of time to get it reviewed. To make things smoother, would you be ok for a quick call to walk me through the code?
Of course. Doesn't have to be quick for my sake. Check your email.
For all the talk of k-dimensions, the end result of writing the tree is a run-of-the-mill binary tree whose search functions are very easy to understand. In the end, we have a binary tree of inner nodes that have a left and right pointer to child nodes and a bounding box. During a search the bounding box allows us to eliminate sub-trees that do not intersect. The binary tree terminates in leaf notes that have a bounding box and a pointer to where triangles are stored. If the path you follow intersects your sought bounding box, you scan the triangles referenced by the leaf node.
Triangles? Yes, the polygons you add to the index are tessellated into triangles. The i_triangle crate used for tessellation in this pull request has an animation of triangulation in its GitHub README. The animation is all you need to understand what we're trying to accomplish. We are tokenizing a polygon, turning it into triangles. We know we've matched a document if we match a triangle in the polygon. Triangles are described with three points and therefore easy to store. It was not necessary to study tessellation for this implementation, it was enough to simply choose an implementation that met requirements; it works on integers and tracks boundary edges.
With the polygon broken down into triangles we can store a triangle and the DocId associated with the polygon. We also store boundary edges that the triangle shares with the tessellated polygon. We can test if a polygon is within a bounding box by testing if a boundary edge crosses the bounding box. If it does, the polygon is marked as excluded, if it doesn't it is included. Take the difference for the set of documents within the bounding box. For intersection all we need to know is whether the bounding box overlaps the triangle in any way.
In this way, this pull request implements an orthogonal range query to search for 2-dimensional shapes within axis-aligned partitions, i.e. bounding-box search.
To grasp the kd-concept quickly, it is enough to see a simple tree and the space it partitions. Otherwise, you could read a Medium article. You will come away with an understanding of how to partition two-dimensional space. You will know how to search for a point in a two-dimensional kd-tree. It's a matter of descending the tree and choosing a path based on the depth of the tree. There are plenty of little examples of this 2-dimensional kd-tree of points in every programming language.
You will be left wondering how to search for a triangle in two-dimensional space. Nowhere will you find a description of how to do this. A search for answers is going to trip over the fact that the word "dimension" is overloaded. Our minimal kd-tree stores 1-dimensional spatial data in a tree that it describes as 2-dimensional. When we use the term "dimension" to describe a kd-tree we are describing the number of partitioning dimensions of the tree, not the spatial dimension of the indexed forms.
What I found, what finally clicked, was a single "teach a man to fish" answer to a StackOverflow question. A 4-dimensional point can represent a bounding box where the partitioning dimentions are the min and max and the x and y. I wrote a program to create a kd-tree with 4-dimensions and was able to search for a box with a box "using the range (a0,+inf) x (-inf, a1) x (b0, +inf) x (-inf, b1)".
Meanwhile, I had been stepping through the Lucene search where I'd see the bounding boxes in the nodes during search and knew that Lucene was depending on the properties of a kd-tree to build the tree, but not to search the tree. It did not search the tree using a 4-dimensional point, it used bounding boxes instead. It would build the bounding boxes as it partitioned the tree.
Another muddle on the way to understanding is the concept of a "block" kd-tree. Seeing BKDReader.java and BKDWriter.java in Lucene implied that Lucene has implemented a BKD-tree as defined in Bkd-Tree: A Dynamic Scalable kd-Tree and so I assumed that whatever was in the paper was in Lucene. The paper describes an optimization based on creating LSM-like sets of progressively larger of kd-trees. Kd-trees are difficult to update, which is what this paper addresses, but Lucene and Tantivy already address this problem with segments. There's only one kd-tree for a spatial field per segment and that tree is rewritten when segments merge.
I never looked at the paper again once I began stepping through the Lucene code. There I saw the bounding box in kd-tree node used for search and affirmed my hyper-point as bounding box understanding by seeing that Lucene was indeed indexing the 2-demential (spatial) triangle bounding box using a 4-dimensional (partitions) tree. I saw too, that a single tree was built by bulk-partitioning. I never referenced the original paper myself, but here I am standing on the shoulders of giants, reading the theory in practice, so I'm not the one to ask how the paper influenced the design of Lucene, I can only talk about how Lucene influenced the design of this pull request.
However, since we are not creating a LSM-like structure nor using the serialization strategy presented in the paper, I'm going to rename bkd.rs to kd.rs to put an end to red herring for future readers. The term "block" in the paper references an I/O optimization to the serialized tree structure that is not employed by this pull request nor Lucene.
That's enough red herrings. The tree is a 4-dimensional kd-tree built with a recursive partitioning of an array of triangles. The 4-dimensions represent the four positions of the two points in a bounding box. We partition our array of triangles on the kd-tree dimension with the widest spread, the greatest distance between the min and max value. We calculate a bounding box to encapsulate the triangles in the array and then call our partition function on the left and right partitions. We create an inner node that has a bounding box and a left and right child reference. When an array is less than a certain size, that's a leaf.
If you look at the write_leaf_pages function, you'll see the partitioning. As of this commit, that is all the k-dimensional code, the rest is bounding box based. In fact, kd-tree is probably a misnomer. We're building a binary bounding box tree (a name I just now made up) using k-dimensional partitioning.
P.S. — As you can see, the implementation in this pull request does not implement a generic kd-tree like Lucene. We do not keep the partition dimension in the node, we throw it away and use just the bounding box. Lucene uses its kd-tree for range queries, but I believe Tantivy has "fast fields" for that use case. It has some code to do kNN with kd-trees, but the primary index it uses for kNN is a Hierarchical Navigable Small World index. I'd speculate that the kd-tree kNN was an offering that was superseded by HNSW.
@flatheadmill Hello. I had time to have a deeper look at your PR.
For large PR like this, I usually modify the code as I look around. You can check that work there: https://github.com/quickwit-oss/tantivy/pull/2755/files
I will put my comments inline, you can also just pick the commits from #2755 when it makes sense to you.
Was I not supposed to cherry-pick your changes into the PR @fulmicoton ? The @fulmicoton-dd user is unexpected.
@flatheadmill Yes if you can cherry-pick that works! Sorry both fulmicoton and fulmicoton-dd are myself. I had to create a separate work account for silly reasons.
@philippemnoel I see that that you spotted that PR. One thing that might be missing here is support for GIS projections.
This PR treats geo search using long lat coordinates.
You can expect paranormal phenomenons near the poles, etc.
What you have in Postgres uses GIS projections. Maybe we should integrate that in tantivy (behind a feature flag to avoid bringing too many dependencies) or by adding a way to provide the mapping (lat, lon) <-> (x, y) as a trait.
@philippemnoel I see that that you spotted that PR. One thing that might be missing here is support for GIS projections.
This PR treats geo search using long lat coordinates. You can expect paranormal phenomenons near the poles, etc. What you have in Postgres uses GIS projections. Maybe we should integrate that in tantivy (behind a feature flag to avoid bringing too many dependencies) or by adding a way to provide the mapping
(lat, lon) <-> (x, y)as a trait.
Interesting. Yeah, Geo search has been requested a few times. In Postgres, PostGIS is very popular and we already package it as part of our Docker image, but finding a way to integrate it with Tantivy (or replace its functioality with Tantivy's Geo search) would be desirable to our users: #3719. A feature flag sounds like a great idea.
I've created a demo program with which I've indexed the GeoFabric OSM data for the United States.