Improve Geometry:::Union() performance using clustering
This PR implements a pre-clustering step for Geometry::Union() to divide inputs into disjoint sets before performing the union operation. Performance testing shows a large benefit in datasets with many disjoint polygons and no significant penalty in datasets with no disjoint polygons.
I tried using envelope intersection, geometry intersection, and geometry distance as the basis for the clustering. The testing results show that geometry distance is preferred, since its performance is essentially equivalent to geometry intersection and is safe in the case where snap-rounding causes connected outputs from disjoint inputs. Timings are as follows, using geosop union with 5 repeats (-r 5) for all except the most complex dataset where only one iteration was performed.
| Dataset | Iterations | Geometry Clusters | main [s] |
Geom isect [s] | Speedup | Geom dist [s] | Speedup | Env isect [s] | Speedup | Env Clusters |
|---|---|---|---|---|---|---|---|---|---|---|
| world | 5 | 8226 | 12.496 | 1.62 | 87.0% | 1.86 | 85.1% | 7.42 | 40.6% | 2549 |
| watersheds | 5 | 1 | 11.273 | 11.1 | 1.5% | 10.94 | 3.0% | 11.33 | -0.5% | 1 |
| watersheds_buf | 5 | 1 | 15.414 | 15.87 | -3.0% | 15.83 | -2.7% | 15.18 | 1.5% | 1 |
| counties | 5 | 98 | 107.776 | 105.7 | 1.9% | 107.4 | 0.3% | 102.065 | 5.3% | 66 |
| subdiv | 1 | 175694 | 217 | 44.6 | 79.4% | 53 | 75.6% | 196 | 9.7% | 7278 |
| voronoi | 50 | 1 | 2.5 | 2.8 | -10.8% | |||||
| vt parcels | 1 | 1 | 90 | 110 | -23.6% | 94 | -5.6% | 1 |
The datasets are available in https://drive.google.com/drive/folders/1YNDsce_YiewgOiafPeJJOF4_AXlenX1C?usp=sharing
The thing I don't understand here is, isn't CascadedUnion already effectively doing this? The tree puts near-to-each-other things early in the union order. Is this on top of, or in replacement of, cascadedunion?
This is on top of cascaded union. This prevents the pieces that don't intersect from going into the overlay engine together.
Counter-intuitive. I'd have thought, working bottom-up, the disjoint parts would mostly come together near the end and get quickly no-opped with an envelope intersection test.
The first and last columns ("geometry clusters" and "envelope clusters") give some idea of this. Canadian subdivisions (last row) form 175k geometry clusters but only 7k envelope clusters.
Maybe it's possible to do a reasonable algorithm selection based on average number of vertices/polygon. Above a certain number of vertices/polygon the union operation can be expected to be expensive enough that the pre-clustering isn't going to slow things down too much even if it accomplishes nothing. Here are the vertex count distributions for the current test datasets. The datasets that see meaningful slowdown (voronoi, parcels) have ~2 orders of magnitude fewer vertices than the others.

Updated to revert changes to the default union strategy. Instead, GEOSDisjointSubsetUnion is exposed to the C API so clients can use it if they like.