geos icon indicating copy to clipboard operation
geos copied to clipboard

Improve Geometry:::Union() performance using clustering

Open dbaston opened this issue 3 years ago • 5 comments

This PR implements a pre-clustering step for Geometry::Union() to divide inputs into disjoint sets before performing the union operation. Performance testing shows a large benefit in datasets with many disjoint polygons and no significant penalty in datasets with no disjoint polygons.

I tried using envelope intersection, geometry intersection, and geometry distance as the basis for the clustering. The testing results show that geometry distance is preferred, since its performance is essentially equivalent to geometry intersection and is safe in the case where snap-rounding causes connected outputs from disjoint inputs. Timings are as follows, using geosop union with 5 repeats (-r 5) for all except the most complex dataset where only one iteration was performed.

Dataset Iterations Geometry Clusters main [s] Geom isect [s] Speedup Geom dist [s] Speedup Env isect [s] Speedup Env Clusters
world 5 8226 12.496 1.62 87.0% 1.86 85.1% 7.42 40.6% 2549
watersheds 5 1 11.273 11.1 1.5% 10.94 3.0% 11.33 -0.5% 1
watersheds_buf 5 1 15.414 15.87 -3.0% 15.83 -2.7% 15.18 1.5% 1
counties 5 98 107.776 105.7 1.9% 107.4 0.3% 102.065 5.3% 66
subdiv 1 175694 217 44.6 79.4% 53 75.6% 196 9.7% 7278
voronoi 50 1 2.5 2.8 -10.8%
vt parcels 1 1 90 110 -23.6% 94 -5.6% 1

The datasets are available in https://drive.google.com/drive/folders/1YNDsce_YiewgOiafPeJJOF4_AXlenX1C?usp=sharing

dbaston avatar Sep 27 '22 03:09 dbaston

The thing I don't understand here is, isn't CascadedUnion already effectively doing this? The tree puts near-to-each-other things early in the union order. Is this on top of, or in replacement of, cascadedunion?

pramsey avatar Sep 27 '22 15:09 pramsey

This is on top of cascaded union. This prevents the pieces that don't intersect from going into the overlay engine together.

dbaston avatar Sep 27 '22 16:09 dbaston

Counter-intuitive. I'd have thought, working bottom-up, the disjoint parts would mostly come together near the end and get quickly no-opped with an envelope intersection test.

pramsey avatar Sep 27 '22 16:09 pramsey

The first and last columns ("geometry clusters" and "envelope clusters") give some idea of this. Canadian subdivisions (last row) form 175k geometry clusters but only 7k envelope clusters.

dbaston avatar Sep 27 '22 16:09 dbaston

Maybe it's possible to do a reasonable algorithm selection based on average number of vertices/polygon. Above a certain number of vertices/polygon the union operation can be expected to be expensive enough that the pre-clustering isn't going to slow things down too much even if it accomplishes nothing. Here are the vertex count distributions for the current test datasets. The datasets that see meaningful slowdown (voronoi, parcels) have ~2 orders of magnitude fewer vertices than the others.

image

dbaston avatar Sep 28 '22 20:09 dbaston

Updated to revert changes to the default union strategy. Instead, GEOSDisjointSubsetUnion is exposed to the C API so clients can use it if they like.

dbaston avatar Feb 21 '23 00:02 dbaston