duckdb_spatial
duckdb_spatial copied to clipboard
`ST_UnionAgg` discrepancy with geopandas (GEOS)
While working on the Ibis support for duckdb geospatial, I ran into a small discrepancy with ST_UnionAgg vs the Geopandas equivalent unary_union. Geopandas relies on the GEOS implementation, and while DuckDB docs also say they rely on the GEOS implementation the generated geometries where slightly different.
See red point slightly off in the plot in https://gist.github.com/ncclementi/fdc83522f605ffda7173420cb98f08cc
For testing purposes we found out the difference was minimal given that after enforcing a precision with shapely.set_precision we got the test to pass but we still want to share what we found.
cc: @jorisvandenbossche for visibility
Hi! Thanks for the nice test-case. We do use GEOS, however st_union_agg, is implemented by iteratively union:ing all the geometries into an accumulator one by one, while it seems to me from a quick glance that GeoPandas (actually Shapely) uses a single call to UnaryUnionOp over a collection of geometries. My guess is that the geos binary union is not completely commutative in all cases and that the discrepancy will change depending on the order of the input set.
We could implement an overload for the basic st_union that works similar to shapely, e.g. taking a list of geometries and using the unary union operation. We could then use it as an aggregate using DuckDB's list_agg, which should give the same result as geopandas/shapely, but with the similar limitation that the input set needs to fit into memory.
We should also probably optimize st_union_agg to use a single unary union operation for each vector of input instead of unioning all geometries one by one. That may also help in making the result more in line with geopandas.
Thanks for the explanation @Maxxen .
As far as I have experimented with st_union there is no discrepancy, it only happens in the agg case. We tested against geopandas here and thing match https://github.com/ibis-project/ibis/pull/7454/files#diff-2553b7a98d93b9e3e71a8ef733bd538d2b3f2f2f3b1ece1576195f52a16b1305R139-R153
Idk how much you care to match exactly geopandas/shapely, if it is at the expense of performance, but when there are some more docs, it might be worth mentioning these caveats.
We should also probably optimize
st_union_aggto use a single unary union operation for each vector of input instead of unioning all geometries one by one. That may also help in making the result more in line with geopandas.
When I was looking at it with @ncclementi, this is also what I was going to suggest. It's logical that duckdb can give tiny differences in the result because indeed unions are not perfectly associative or commutative, and duckdb cannot necessarily do the operation in one step. But I think there is no reason (apart from the effort to implement it, of course ;)) to not use GEOS' UnaryUnion for each batch of geometries, compared to calling Union for every single geometry. That should provide a nice optimization I think (UnaryUnion should have a bunch of tricks to optimize this)
We could implement an overload for the basic
st_unionthat works similar to shapely, e.g. taking a list of geometries and using the unary union operation.
Just a small correction: shapely's union is also only a binary function, we have a specific unary_union/union_all that is the aggregation variant (and essentially the equivalent of st_union_agg, and the one that uses GEOS' UnaryUnion under the hood)
Sounds good! I agree the current implementation is a bit lacking due to laziness on my part - I've been meaning to revisit union ops in general to support other variants provided by GEOS and PostGIS like ST_MemUnion.
I'll report back in this thread once I've implemented the unary union optimization in the aggregate and we can rerun the shapely comparison.
I figured it's probably worth adding my experience, which is that I have a groupby-dissolve in geopandas that runs in ~5s which I hoped to recreate in duckdb, but it takes minutes to run. I'm happy to see you're already aware of the performance issue, and I'm more than happy to try out the new method when the time comes!