Serialization of built STRTree
Copied from https://trac.osgeo.org/geos/ticket/1130
For usage in cluster computing workloads, the possibility of dumping / loading a built STRTree to/from some (internal) binary format through the CAPI would be a major enhancement.
Now, a tree needs to be rebuilt in every worker, which has considerable overhead.
It is unclear to me how much effort this would require. But some users would be very happy with this enhancement.
Related:
https://github.com/pygeos/pygeos/issues/274 https://github.com/Toblerity/Shapely/issues/1033
Tree building time is dominated by sorting the items, so we could speed it up by creating a tree from a list of pre-sorted items. The existing C API gets us most of the way there using the following steps:
- Create a tree
- Query the tree so that it gets built
- Retrieve the items in tree-sorted order using
GEOSSTRtree_iterate - Build a tree from items already in tree-sorted order
This gives only a minor performance gain because std::sort is not much cheaper on a vector that is already sorted. To get better performance, we need a way to signal that no sorting is needed. Since we lack a GEOSSTRtree_build function, we could add one with an isSorted parameter.
I did a performance test (https://github.com/dbaston/libgeos/commit/f5ca63b268006aefbfa4d45eedc8ab87f3908707) to see the potential gains, which look to be about 70%. Is this worth it?
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
BM_BuildTree 1395533 ns 1395409 ns 470
BM_BuildTreePresorted 430600 ns 430564 ns 1660
Thanks for the fast response and benchmark!
I must say I am no frequent user of distributed workloads of geometries. Maybe @jorisvandenbossche can judge if the 70% improvement is worth the effort?
Are the geometries actually necessary for the tree or only some simplified (bbox?) version of them?
Are the geometries actually necessary for the tree or only some simplified (bbox?) version of them?
Yes, only the bounding boxes (envelopes) are needed. In our shapely code, we insert the actual geometry, but the C API directly gets the envelope and only inserts that in the tree:
https://github.com/libgeos/geos/blob/32348a68c5212c89cfefab46891d2b3aada4ab40/capi/geos_ts_c.cpp#L3541-L3549
But creating a tree with inserting the envelopes ourselves should be equivalent. So if we would "mimick" serializing a tree, we could store the sorted envelopes, and recreate a tree from that.
In that case the serialised form of a tree could be a vector of envelope corner coordinates ( (x1, y1, x2, y2) * n). That would be a factor 2-3 smaller in size than serializing the envelope geometries to wkb.
In the proposal by @dbaston that is possible through the iterate CAPI, but then we get a superfluous conversion to Geometry. For deserializing we would also need to construct geometries, only to be converted back to envelopes by the tree constructor.
My feeling is that a new pair of functions (serialize/deserialize or dump/load) would form a more significant speedup.
In that case the serialised form of a tree could be a vector of envelope corner coordinates ( (x1, y1, x2, y2) * n). That would be a factor 2-3 smaller in size than serializing the envelope geometries to wkb.
You could serialize the envelopes however you like. If you did use WKB, I would probably represent the entire tree using a single MultiLineString with two-point LineStrings representing the envelopes. Then build the tree with
GEOSGeometry* g = GEOSGeomFromWKB(buf, size);
int ngeoms = GEOSGetNumGeometries(g);
for (int i = 0; i < ngeoms; i++) {
GEOSSTRtree_insert(tree, GEOSGetGeometryN(g, i), items[i]);
}
My feeling is that a new pair of functions (serialize/deserialize or dump/load) would form a more significant speedup.
I don't doubt that it would be faster, but I'm not sure I understand the usage where no-sort tree construction is going to be a significant part of execution time.
I don't doubt that it would be faster, but I'm not sure I understand the usage where no-sort tree construction is going to be a significant part of execution time.
I am not sure how to judge this. I am probably not the right person as I am not personally needing this feature. Maybe a good metric would be the ratio between tree deserialization of N geometries and a common operation on those geometries?
Regarding the serialization, I tried to make an implementation to get the sorted envelopes out of a built tree. Sadly I only got the items back from the iterate and not the geometry or geometry envelopes. So there might be still some work on the GEOS side here? (https://github.com/caspervdw/Shapely/commit/fd06e9bd62d16e30963c6d8431b4a9a112d40fdd)
Close and move on?