tantivy
tantivy copied to clipboard
Geo-search
This would require a longlat field type... (and coords if we want simple 2D as well
Then the user would want to perform queries such as
- return document at a distance of less than X km
- return document within this polygon
Two more practical requirements that I would love to have:
- Store multiple values
- Return which value was found
those two things have a range of usecases that are very practical for me
Curious about the status of this issue. Is this still a desired feature?
I have a need for geo queries in a future project, and would be willing to help implement, although I might need some guidance.
@hwchen Sorry about the late answer
This is a desired feature but this is also a lot of work. I'm thinking of following Lucene idea of using Bkd Tree as it fits well the search indexing scheme.
You can work on it if you are confident, but be careful... This is one of the hardest ticket around and I won't be able to give much guidance.
Thanks, I'll keep this in mind. Maybe in a year or so I'll pick this up if nobody else has :)
Plus one on this feature, very relevant to any local app - Uber Eats, Yelp, Maps, Delivery.
FYI: did a quick search, found one nice bkd tree impl https://github.com/peermaps/eyros
Nice, rucene also has a bkd implementation https://github.com/zhihu/rucene/blob/master/src/core/util/bkd/bkd_writer.rs
FYI: did a quick search, found one nice bkd tree impl https://github.com/peermaps/eyros
Hmm, seems the license of eyros is not compatible with MIT.
Another FYI.
From https://www.elastic.co/blog/lucene-points-6.0, the blog which announced lucene spatial support, there is a quote on future work:
Currently, dimensional values must be single points, but some geo-spatial use cases require indexing shapes as well, which can be done with a generalization of the k-d tree called an R-Tree, for example, and I expect that will be a future improvement to Lucene (patches welcome again!).
And there is a rtree crate called rstar, which provides a good in mem rtree. https://crates.io/crates/rstar
Point search specific methods are easier to implement since they don't require specialized trees and they also perform better. It would make sense to implement point search first as it is probably most commonly used, and implement indexing of shapes at a latter time.
Steps to index geo coordinates (points):
- Convert coordinates form double floating points to integers. If they are converted to 32bit (by multiplying with 10,000,000) we get resolution of about 1cm. OSM stores coordinates as 32 bit integers and Solr does too.
- Bit-interleave latitude and longitude into a single 64 bit integer. This value is the Morton code of a Z-order curve. (There are other space filling curves, like Hilbert curve, that give better spatial locality, but due to additional computational complexity it is questionable, if they would provide better performance overall. Morton code calculations are simpler and can also be accelerated with SIMD.)
- Index that 64-bit integer to any tree that can perform 1D range search (eg Btree)
Steps to query are a bit more complex:
- Get AABB (or any other shape) of the area we want to query
- Calculate where the Z-curve intersects with the AABB edges to get ranges we need to query. This is the complex part. Some names of these functions are BigMin/LitMax, NextJumpIn, GetNextZ-Address. In a more recent paper[1] I saw a method of using quadtrees to calculate these regions, and if I remember correctly, it also performed a bit better.
- Merge ranges into longer continuous ranges where possible.
- Query ranges to get the Morton codes of the points inside that area.
- Since, ranges intersecting with the query-shape can cover a bit larger area that we need (in order too keep the number of ranges to a sensible amount), we need to filter resulting points, to exclude points that falls outside of exact query-area. Morton codes can be converted back to latitude and longitude coordinates in order to perform that filtering.
[1] I can look-up the name of paper using quadtrees to calculate ranges, if anyone is interested.
+1
some resources other engines uses
- https://s2geometry.io/
- https://github.com/yjh0502/rust-s2
- https://github.com/georust/geo
See https://github.com/quickwit-oss/tantivy/pull/2130 for a version using in-memory rstar
indexes and warmers spelled out. If there is sufficient interested, I would be willing to start an add-on crate using that approach.
(I am also working on flat K-D trees and R trees which could be memory-mapped directly from their on-disk representation avoiding the in-memory requirement. They do come from a simulation background though, i.e. generic N dimensional geometry. I guess one would want to specialize on Hilbert-packed two-dimensional trees for geographic search.)