ardb
ardb copied to clipboard
In memory geohash based search
I have downloaded your software and I have started to compute geohashes.
https://github.com/yinqiwen/ardb/wiki/Spatial-Index
Is it possible to do a in memory geohash based indexer ?
I can create a HashMap of the 18 pairs (geohash number,geohash number+1) and it's eight neighbors. Supposing I want to search for coordinates within 5 kms what should be my precision value ? Also I do not understand why I must leftshift to 52 bits.
What data structure would be best to store this so I can do fast searches in memory(without external database) ? Quadtree, R-tree ? HashMap ?
What I thought was store all the 18 members in HashMap and do look ups.
- this spartial-index have been ported to redis since 3.2, it's all in memory.
- 52 bits geohash value represented a box in map which length/height is less than 0.6 meters, and u can do a range search from 0.6 meters.
- since all data encoded by geohash must be stored in a sorted data structure, tree/skiplist seems better.
It looks you are right on the choice of data structure. I extended your code to use Cython so that I can call your encode API and getneighbors in Python( I can give you the Cython extension). Then I created a lat lon grid and I populated the geoHash ,geoHash+1 and the 16 neighbors by encoding and get_neighors call . Then I bit shifted all of them by 52.
Then I encoded a test value to check to see if it is present in the HashMap. And I got a NO. So I will have to use a OrderedSet to do range queries.
from geo import GeoHash
from collections import OrderedDict
import numpy as np
import sys
minLat = 27.401436
maxLat = 62.54858
minLo = -180.0
maxLo = 179.95000000000002
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLo,maxLo,0.05)
geoHash = GeoHash()
geoHashTable = OrderedDict()
geohash1 = {}
for lon,lat in zip(lonGrid,latGrid):
geohash = geoHash.encode(lon,lat,24)
bitsOriginal = geohash["bits"]
bits = bitsOriginal << 52
geoHashTable[bits] = np.c_[lon,lat]
neighbors = geoHash.get_neighbors(geohash)
for k,v in neighbors.items():
if isinstance(v,dict):
for k1,v1 in v.items():
bits1 = v["bits"]
bits = bits1 << 52
geoHashTable[bits] = np.c_[lon,lat]
bitsOriginal1 = bitsOriginal+1
geohash1["bits"] = bitsOriginal1
geohash1["step"] = 24
neighbors1 = geoHash.get_neighbors(geohash1)
bits1 = bitsOriginal1 << 52
geoHashTable[bits1] = np.c_[lon,lat]
for k,v in neighbors1.items():
if isinstance(v,dict):
for k1,v1 in v.items():
bits = v["bits"]`
bits = bits << 52
geoHashTable[bits] = np.c_[lon,lat]
lonTest = 172.76843
latTest = 61.560745
geohashTest = geoHash.encode(lonTest,latTest,24)
bitsTest = geohashTest["bits"]
finalBits = bitsTest << 52
if finalBits in geoHashTable:
print("Yes")
else:
print("No")
1. this spartial-index have been ported to redis since 3.2, it's all in memory. 2. 52 bits geohash value represented a box in map which length/height is less than 0.6 meters, and u can do a range search from 0.6 meters. 3. since all data encoded by geohash must be stored in a sorted data structure, tree/skiplist seems better.
I think I have a way to do this using a HashTable. The idea is the same i.e. approximate nearest neighbors.
What one can do is create a geohash using a very low precision(to be identified). Then points within a given radius will all probably give the same geohash.
Then one can create a hash table and create a key(what determines a key is to be determined) and store the values as the approximate nearest neighbors.
Then given a query point whose geohash is also to be calculated at the same low precision then one can do a O(1) lookup.
This is the same idea that is expressed in Locality Sensitive Hashing. Does that make sense or is it not possible ?
store all points into a skiplist/tree can be used to search by any given radius value.
IMO, store points into hash + vectors what u described can only be used to search by a given radius value.
I agree with what you are saying but if you look at this article - https://arxiv.org/pdf/1408.2927.pdf each one of those given radius values can be stored as a hash + vectors. So there will be more memory that is consumed but comes with a benefit in the form of O(1) lookup. In my original example I gave for a given radius value but you can replicate the hash + vectors for all radius values and do O(1) lookup.
Also have you looked at http://blog.christianperone.com/2015/08/googles-s2-geometry-on-the-sphere-cells-and-hilbert-curve/ S2 cells have the advantage that they are indeed space filling curves ?
IMO. most cpu usage is spent on calculate the distances between given lat/lon and points in the founded areas. Use an O(1) lookup to find the areas did not help a lot on performance.
hilbert space filling curve is actually a better algorithm than current geohash value which would reduce search range.