h3 Suggestion: add an 'exhaustive' option to 'compact'

Suggestion: add an 'exhaustive' option to 'compact'

Open azmeuk opened this issue 3 years ago • 7 comments

I understand from the documentation, or this comment, that the compact function does not represent exactly an area, but is more a compact approximation of that area that can be used for storage or network transfer, and that can be uncompacted to represent, again, the whole area.

I have a use case where compacted data could be used in another purpose than storage or network transfer: My company use ZODB to store data for a web application, and geographic coordinates are stored as H3 cells. Those H3 cells can be indexed. Generally for one coordinates pair, we indexed its corresponding cell and a few parent cells.

Sometimes we have to query items based on an arbitrary geographic zone, so I use polyfill to get the list of cells to look for in the database. Every lookup in the database is resources costly, so I want to minimize the database reads. I thought I could use compact to reduce the number of cells to lookup in the database, but the imprecision of compact makes it not reliable enough to do this.

I suggest adding a exhaustive boolean parameter to the compact function. When True, it would ensure the returned cells cover the entire area.

What do you think?

Apr 25 '21 17:04 azmeuk

I don't think that's necessary, especially for a database application, you can do matches on compacted sets pretty simply.

I'm not familiar with ZODB, so I'll speak more generally, but for any compacted set you have a fixed range of resolutions to deal with. For example, the range could be resolutions 6 to 9. If you want to test if a point is contained within the compacted set, you simply generate H3 indexes at the finest resolution (9) and then generate the other three via h3ToParent (super cheap and following the exact same containment rules as compact). Then do a set intersection, with containment having an intersection of 1 H3 index and not being contained having 0 indexes coming out.

In a SQL-like database, this would be something like:

select other, columns
from mytable
where h3index in (res9index, h3ToParent(res9index, 8), h3ToParent(res9index, 7), h3ToParent(res9index, 6))

Since h3ToParent is just very simple bit manipulation of the index it's not only very fast, it is usually possible to define this transform function in your database of choice so you can actually have your query look like the above, and for more extensible databases you could define an index type that does this manipulation for you under the hood, so then the containment comparison automatically includes any h3Index that is a parent if the index you supplied.

An "exhaustive" parameter would both cause duplicate entries when uncompacting and would potentially return more than 1 index on the set intersection, which can make joining on that data do unexpected things.

Apr 26 '21 17:04 dfellis

Thank you for your answer.

I'm not familiar with ZODB

I use it really as a key-value database, the index is basically a map where keys are h3 cells, and values are sets of references to objects located in that cell. If you are familiar with python, this is very similar to use dict.

If you want to test if a point is contained within the compacted set [...]

The query I do is retrieving all the objects present in a given area. I cannot iterate on all the objects in the database to test them all because they are too many, this is why I made an index :)

Also, I cannot perform h3 specific operations server-side because there is no such thing implemented, but computing all the cells to search in the DB at the client-side seems OK to me.

An "exhaustive" parameter would both cause duplicate entries when uncompacting and would potentially return more than 1 index on the set intersection, which can make joining on that data do unexpected things.

I think this is not a problem as we can work the data afterward to make values unique. You just have to be aware that there may be double values when you use exhaustive compact.

Apr 26 '21 18:04 azmeuk

Then I think querying for "root" index and all parent indexes should work client side for you, though be a bit expensive for the database since it runs the scan n times.

Based on your description, though, I'm not sure how your "exhaustive" proposal would actually work? Wouldn't you still have to generate the h3index client side and test it's containment through presence in the index? You wouldn't be providing a lat, lng pair and testing containment of that point on each index, as that would be very slow. Could you explain how this "exhaustive" mode would help you out?

Apr 26 '21 18:04 dfellis

Based on your description, though, I'm not sure how your "exhaustive" proposal would actually work?

That would just imply that the generated set of cells wouldn't have any holes.

Ideally, compact + exhaustive=True would output the smallest set of cells that covers the entire area given in input.

Then I think querying for "root" index and all parent indexes should work client side for you, though be a bit expensive for the database since it runs the scan n times.

I don't think it fits my usecase. Sorry to be unclear, I will try to explain it better with this excessively simplified python snippet:

import h3
from collections import namedtuple
from random import uniform

# my real usecase is a web application.
# suppose objects are stored and indexed once for all, rarely edited, but often read

write_resolution = 10
read_resolution = 5

# definition of a simple FoobarObject with 'lat' and 'lng' parameters
FoobarObject = namedtuple("FoobarObject", ["lat", "lng"])

##############
# Indexation #
##############

# creation of a set of object, with random coordinates
nb = 100
objects = [
    FoobarObject(lat=uniform(-90., 90.), lng=uniform(-180., 180.)) for _ in range(nb)
]

# creation of an index, where a key is a cell and a value is a
# sets of FoobarObject located in that cell.
# suppose access to this index is done through a database adapter, goes sometimes
# through network, thus read and writes are costly
index = {}

# iterate over all the FoobarObjects...
for obj in objects:
    cell = h3.geo_to_h3(obj.lat, obj.lng, write_resolution)

    # iterate over all the parent cells.
    # (suppose we have good reasons to index all the parent
    # resolutions, and it is cheap anyway)
    for res in range(write_resolution + 1):
        c = h3.h3_to_parent(cell, res)

        # store the cell in the index
        index.setdefault(c, set()).add(obj)

##############
#   Query    #
##############

# define a random polygon
# in real life, it is defined by a end-user and can arbitrarily be great or small
lat1 = uniform(-90., 90.)
lat2 = uniform(-90., 90.)
lng1 = uniform(-180., 180.)
lng2 = uniform(-180., 180.)

lat_min = min(lat1, lat2)
lat_max = max(lat1, lat2)
lng_min = min(lng1, lng2)
lng_max = max(lng1, lng2)

sw = lat_min, lng_min
se = lat_min, lng_max
nw = lat_max, lng_min
ne = lat_max, lng_max

# get cells inside that polygon
query_cells = h3.polyfill_polygon(
    [sw, se, ne, nw],
    # suppose we compute a clever read resolution here,
    # based on the size/area of the polygon
    read_resolution,
)

# query the FoobarObjects whose cells match the polygon
# in real life situation, this kind of query is resource expensive
# so the minimum cells in the query, the better
#
# if 'compact' was trustworthy (i.e. without holes), we could use it here
# to reduce the number of cells to query
# query_cells = h3.compact(query_cells)

index_entries = [index.get(cell, set()) for cell in query_cells]

# deduplicate results
query_objects = set.union(*index_entries)

This code actually works and represents the issue I encounter in real life. Index are created and edited very rarely, but query happens very often, and in a web request so the delivery has to be fast.

Disk space is cheap, and indexes access is fast, and does scale. In my situation there is no problem with indexing large quantities of data (e.g. index all the parent cells for any given FoobarObject).

This code would be more efficient with the compact option I suggest, since it would reduce the number of cells to query in the index.

Does it help?

Then I think querying for "root" index and all parent indexes should work client side for you, though be a bit expensive for the database since it runs the scan n times.

I am in a web context, so I cannot spend time on iterating a lot of objects in the database. Ideally I must have a constant or logarithmic time access, like the snippet does.

Wouldn't you still have to generate the h3index client side and test it's containment through presence in the index? You wouldn't be providing a lat, lng pair and testing containment of that point on each index, as that would be very slow. Could you explain how this "exhaustive" mode would help you out?

I am afraid I do not understand your question. Is it still relevant a the sight of this code snippet?

Apr 26 '21 20:04 azmeuk

Getting back to this (end of the work day), I don't see how even your "exhaustive" compact proposal would work here. If there was a single compacted hexagon representing 7 children in the original polygon, then the fill in the "gaps" (I still don't agree with this reasoning, but continuing on) would require adding back in 6 of the 7 children, which means you don't actually compact anything at all, as the parent hexagon would only actually be representing its center child. You can get some minor compaction benefits if there's multiple contiguous compacted hexagons, but not that much.

Furthermore, that center child hexagon that is represented by the parent hexagon? That would still fail your query here because it's resolution is different and therefore any coordinates within the center child of the parent hexagon would also fail to show up.

The source of my confusion was because I assumed the compacted polygon was what was being stored in the database, not the inverse, which is point data in the database and the query being a polygon drawn around them. Since your query requires probing the database on a per-key basis (it's a hash index, not a b-tree index), I cannot see any way to avoid querying on each individual h3 index at the resolution you have indexed the data to, because any parent hex ID will fail because the value of that ID by definition cannot match any of its children, since it is a different hexagon.

If you can do range queries, treating the index as an integer, there are some possibilities for efficiency gains here, but if this is truly only a hash index, then any compact set will fail for you, "exhaustive" or not.

Now, if I could refactor your database schema? Your particular situation with mostly static point data queries often is also perfectly suited for an R-Tree, which PostGIS provides (along with a lot of other things). You could bolt this onto your ZODB database by using rtreelib to index your data with simple UUIDs as the payload data and put the actual payload in the ZODB keyed by that UUID, then take the resultant rtree index and shove it into a special key, such as 'rtree'. This only makes sense if the rtree index is much smaller than the actual payload data in question, but then your queries should be very efficient, you first get the r-tree index, ask it for the data contained by your polygon, then ask ZODB for those UUIDs.

But if you want to stick to H3, you could instead index your data at a few coarser resolutions, such that for the various sized geofences that you expect to query with, you keep the number of hexagons that polyfill it at around 10 or less for one of the resolutions in question. So for example you have your data indexed a few times at resolutions 4, 5, 6, and 7.

You run polyfill on these resolutions, starting with the smaller one and going to the largest, and then select the prior resolution when you cross the threshold of 10 (or just select the highest resolution otherwise). Then you take this set of hexagons, compute the first kRing around (and including) them, and then dedupe the results, this makes sure you have total coverage around your polygon. Now query the hexagons at the chosen resolution from the database and you will get all of your data plus some overage, but far fewer than doing a full index scan. Take this subset of points and do a point-in-poly pass with the query polygon to eliminate false positives and you have your result.

This is assuming that the number of points of interest inside of the polygon is much smaller than the number of hexagons at your indexing resolution, and that the total number of points indexed is much smaller than the entire index. Otherwise other options may be a better bet, but the scaling factor of this particular approach is: your data has a constant factor overhead (say 4x) in the database, but you also only have a constant factor number of queries to run (approx 10) and then you have an O(n) factor of eliminating false positives with point-in-poly.

With a more traditional database where you can have multiple indexes, the storage overhead would only be the indexes and not the payload itself. You could fake that with the same UUID technique I mentioned before, but that doubles the number of queries to run (first for the UUIDs, second for the actual payload). Depending on the size of each point of data that could still make sense.

Apr 27 '21 01:04 dfellis

any compacted set you have a fixed range of resolutions to deal with. For example, the range could be resolutions 6 to 9. If you want to test if a point is contained within the compacted set, you simply generate H3 indexes at the finest resolution (9) and then generate the other three via h3ToParent (super cheap and following the exact same containment rules as compact). Then do a set intersection, with containment having an intersection of 1 H3 index and not being contained having 0 indexes coming out.

@dfellis Some area might not be covered by compact cells as shown in here https://github.com/uber/h3-js/issues/99. So if we take a point in gap area from here (https://user-images.githubusercontent.com/20339513/96299901-bcc48580-0fc2-11eb-9c3f-efdb17588aef.png). Will any resolution h3 cell around the point be contained in compact cells? Given that visualisation of all the compact cells does not cover area in gaps.

Mar 25 '22 09:03 Aniket1mg

@Aniket1mg everything above was explaining why the compact visualization that has gaps still results in full coverage when you simply convert the point query into one that performs two set intersections (the compacted set with the set of geoToH3 hexagons for each resolution represented in the compacted set).

If the point is covered, exactly one h3 index will be in the intersection, otherwise zero will be.

Much of the discussion above was about how to do this set intersection efficiently within a database.

Mar 25 '22 13:03 dfellis

h3 h3 copied to clipboard

Suggestion: add an 'exhaustive' option to 'compact'

h3
h3 copied to clipboard