geos icon indicating copy to clipboard operation
geos copied to clipboard

Serialization of built STRTree

Open caspervdw opened this issue 3 years ago • 7 comments

Copied from https://trac.osgeo.org/geos/ticket/1130

For usage in cluster computing workloads, the possibility of dumping / loading a built STRTree to/from some (internal) binary format through the CAPI would be a major enhancement.

Now, a tree needs to be rebuilt in every worker, which has considerable overhead.

It is unclear to me how much effort this would require. But some users would be very happy with this enhancement.

Related:

​https://github.com/pygeos/pygeos/issues/274 ​https://github.com/Toblerity/Shapely/issues/1033

caspervdw avatar Oct 17 '22 09:10 caspervdw

Tree building time is dominated by sorting the items, so we could speed it up by creating a tree from a list of pre-sorted items. The existing C API gets us most of the way there using the following steps:

  • Create a tree
  • Query the tree so that it gets built
  • Retrieve the items in tree-sorted order using GEOSSTRtree_iterate
  • Build a tree from items already in tree-sorted order

This gives only a minor performance gain because std::sort is not much cheaper on a vector that is already sorted. To get better performance, we need a way to signal that no sorting is needed. Since we lack a GEOSSTRtree_build function, we could add one with an isSorted parameter.

I did a performance test (https://github.com/dbaston/libgeos/commit/f5ca63b268006aefbfa4d45eedc8ab87f3908707) to see the potential gains, which look to be about 70%. Is this worth it?

----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_BuildTree             1395533 ns      1395409 ns          470
BM_BuildTreePresorted     430600 ns       430564 ns         1660

dbaston avatar Oct 17 '22 21:10 dbaston

Thanks for the fast response and benchmark!

I must say I am no frequent user of distributed workloads of geometries. Maybe @jorisvandenbossche can judge if the 70% improvement is worth the effort?

Are the geometries actually necessary for the tree or only some simplified (bbox?) version of them?

caspervdw avatar Oct 18 '22 08:10 caspervdw

Are the geometries actually necessary for the tree or only some simplified (bbox?) version of them?

Yes, only the bounding boxes (envelopes) are needed. In our shapely code, we insert the actual geometry, but the C API directly gets the envelope and only inserts that in the tree:

https://github.com/libgeos/geos/blob/32348a68c5212c89cfefab46891d2b3aada4ab40/capi/geos_ts_c.cpp#L3541-L3549

But creating a tree with inserting the envelopes ourselves should be equivalent. So if we would "mimick" serializing a tree, we could store the sorted envelopes, and recreate a tree from that.

jorisvandenbossche avatar Oct 18 '22 08:10 jorisvandenbossche

In that case the serialised form of a tree could be a vector of envelope corner coordinates ( (x1, y1, x2, y2) * n). That would be a factor 2-3 smaller in size than serializing the envelope geometries to wkb.

In the proposal by @dbaston that is possible through the iterate CAPI, but then we get a superfluous conversion to Geometry. For deserializing we would also need to construct geometries, only to be converted back to envelopes by the tree constructor.

My feeling is that a new pair of functions (serialize/deserialize or dump/load) would form a more significant speedup.

caspervdw avatar Oct 19 '22 07:10 caspervdw

In that case the serialised form of a tree could be a vector of envelope corner coordinates ( (x1, y1, x2, y2) * n). That would be a factor 2-3 smaller in size than serializing the envelope geometries to wkb.

You could serialize the envelopes however you like. If you did use WKB, I would probably represent the entire tree using a single MultiLineString with two-point LineStrings representing the envelopes. Then build the tree with

GEOSGeometry* g = GEOSGeomFromWKB(buf, size);
int ngeoms = GEOSGetNumGeometries(g);
for (int i = 0; i < ngeoms; i++) {
  GEOSSTRtree_insert(tree, GEOSGetGeometryN(g, i), items[i]);
}

My feeling is that a new pair of functions (serialize/deserialize or dump/load) would form a more significant speedup.

I don't doubt that it would be faster, but I'm not sure I understand the usage where no-sort tree construction is going to be a significant part of execution time.

dbaston avatar Oct 19 '22 13:10 dbaston

I don't doubt that it would be faster, but I'm not sure I understand the usage where no-sort tree construction is going to be a significant part of execution time.

I am not sure how to judge this. I am probably not the right person as I am not personally needing this feature. Maybe a good metric would be the ratio between tree deserialization of N geometries and a common operation on those geometries?

Regarding the serialization, I tried to make an implementation to get the sorted envelopes out of a built tree. Sadly I only got the items back from the iterate and not the geometry or geometry envelopes. So there might be still some work on the GEOS side here? (https://github.com/caspervdw/Shapely/commit/fd06e9bd62d16e30963c6d8431b4a9a112d40fdd)

caspervdw avatar Oct 27 '22 20:10 caspervdw

Close and move on?

pramsey avatar Jun 02 '23 20:06 pramsey