spatial4j
spatial4j copied to clipboard
Encode Geo-hashes with Precision
Encode Geo-hashes with Precision
By looking at the Earth shape, one recognizes the size of geohash cells shrinks by moving away from the equator along an arbitrary latitude. In other words the radius depends on the latitudinal value. In cases where a user should able to define a precision he needs, this fact can be used to reduce the length of generated hashes. Also a precision value is a reasonable alternative to the currently used maxtreelevel value, used in geohash implementations.
An Example
The geohash 7zzzzzz refers to a geohash cell latlon1~(-0.00137, -0.00137) to latlon2=(0, 0) near the equator. The distance between these point corresponds to ~216.19656m. A geohash cell at a pole gzzzzzz with the same hash length refers to a cell with a size up to ~152.87406m. If i.e. a precision of 200m is accurate enough to encode geo-positions one can use hashes of length 8 for points between the latitudes -32.4837° and +32.4837° and geohashes of length 7 otherwise.
Hi; looks interesting. I hope to give it a finer look this weekend. Have you seen SpatialPrefixTree in Lucene spatial? It already deduces the proper geohash length for the desired precision.
Hi @dsmiley, I've seen the SpatialPrefixTree in Lucene and I'm working on this already, but it seems to me that it reduces the geohashes by lat/lon errors only. My idea is to consider the curvature of spheres. So we're able to save another treelevel.
The grid cells do get smaller towards the poles but only longitudinally. In other words, they get skinnier, but they have the same height relative to anywhere else. So for a given desired real world distance precision you are looking to maintain, doesn't this mean you effectively need a particular geohash length no matter where the cell is?
No. By looking at the size of cells on different levels the influence of the curvature raises. This means a fixed length of geohashes compute cells that will be much smaller than a given distance. Namely those away from the equator. For these cells we can ignore the last character of the hash and still perform the given precision.
It would be nice to empirically test/evaluate what the precision is by developing a simple program or test, that way it can be better understood. At least your patch makes your feature addition optional. I'm concerned that your implementation will slow down the geohash generation substantially since at each character it computes a distance which is a non-trivial calculation to do this frequently.
By the way, on my mental roadmap of stuff to work on is creating a new SpatialPrefixTree encoding that is more optimal than geohash. I've used geohash to date simply because there was existing code that only needed minor additions. A more optimal encoding would have these properties:
- Hilbert Curve ordering
- Configurable # sub-cells at each level (e.g. top level use 256 but then use 16 for remaining levels, and 64 at the bottom).
- Use positive power-of-2 integer dimensions instead of direct double lat/lon. Allows faster computation using bit shifting. At the beginning and/or end the answer can be multiplied by a scale factor and shifted (-180.0 to 180.0)
- Consider mapping latitudes and longitudes to an equal-area map projection such that each grid cell has the same real world area.
- To avoid skinny cells, if a cell's width is less than half its height, then divide the cells into 4 vertically stacked instead of 2 by 2.
The ideas you are planning, will improve the project a lot, I think. Especially configurable sub-cells is a nice approach. But I also think that these changes will need some time to be implemented. In my opinion the precision enhancement should be included to keep the progress alive. I accept your concerns my implementation will slow down the the geohash generation. But as you also mentioned, the size calculation is optional. So I can add a note to the documentation which will describe this behaviour.
Please take a look at the feature-branch I added for your commit. I made some modifications. I'm have pretty basic GitHub skills but I believe you can comment at the source code level and I'll see those remarks in context.
I had a look at the changes you have made and I'm concerned about the parameter name precisionDeg only. The whole idea is based on a measurement different from the angular measures. The circumference of a sphere is always 360°, no matter which latitude is set. Measuring the distance in fractions of the major circumference leads to values that depend on the latitudinal value. The precision used is not measured in a angular measure and should not carry the suffix Deg.
The suffix "deg" shows it's degrees based instead of radians. This convention is consistent with other parts of Spatial4j. Degrees in the context of a distance is an angular measure, not dependent on latitude.
I realize that this choice that I made with Spatial4j is perhaps controversial; I don't know of other spatial APIs that work this way. We wanted to standardize the distance measure for all of Spatial4j. Picking KM or M assumes a particular earth radius, and there are multiple to pick from. Picking radians solves that. Though there's something unsatisfying about these choices because they differ substantially with the X & Y units (lat & lon degrees if geospatial). I had the realization that all the X's and Y's (longitudes and latitudes) had units -- degrees, and so why not use the same? By picking degrees, this has the nice benefit that the definition of the radius of a circle doesn't substantially change depending on wether you have a cartesian/euclidean context or a geodetic one. The radius and distances are in the same units as X & Y. I am aware of course the real world distance of a longitude depends on the latitude, but nonetheless the unit is "degrees". The effective range of any of these numbers are thus also constrained 0-180 or 0-90.