duckdb_spatial OOM aggregate / `group by` query on Geometry type

Problem

Good morning! I want this query to run, and perform an aggregate over the GEOMETRY type column. Any help would be greatly appreciated.

Traceback

  terminate called recursively
  terminate called after throwing an instance of 'duckdb::OutOfMemoryException'

  Out of Memory Error: could not allocate block of size 256.0 KiB (49.9 GiB/49.9 GiB used)

alternatively, will get:

zsh killed

Steps to re-create

Running this query against a row table stored in .duckdb file, locally.

      SELECT
             row_id
             ST_MakeValid(ST_Union_agg(geom_column)) AS geometry_column,
             array_agg(property_id) AS property_ids
      FROM my_geometry_table 
      GROUP BY row_id

my_geometry_table is 35GiB on the .duckdb file and comprises 153180272 rows.
Hits OOM within a few seconds. I've set a limit 10 clause just to see, and same OOM timeframe.
I've set DuckDB memory_limit to varying values from 512MB up to 60GB (maximum memory on machine)
I've set threads=1 as well.
When I remove the ST_MakeValid(ST_Union_agg()) it runs.
I attempted this query then, in two parts, where I just call array_agg() on the GEOMETRY column, first, and in a subsequent query, perform the ST_Union_agg(). Even the array_agg() gets OOM killed in moments.
The memory/thread settings seem to have no effect.

I can't share the data underneath the query, unfortunately, but wondering if the information provided suffices to diagnose a memory leak or some optimization I am missing.

Environment:

 % uname -a
Linux john-XPS-15-9520 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

 % duckdb --version
v1.1.2 f680b7d08f

My Machine:

64GiB Memory
20 CPU
Dell XPS

Oct 18 '24 16:10 johngcrowley

Hi! Thanks for opening this issue! There's couple of things at play here. Neither ST_Union_Agg or array_agg are able to spill intermediate state to disk as they are "variable size" aggregates - they basically keep pointers into an unpredictable amount of memory, so when they are offloaded the disk, only the pointer gets serialized. This is currently a fundamental limitation of how DuckDB (actually im not aware of any system that can spill variable-size aggregates to disk), however in the case of array_agg duckdb will at least track the amount of memory used, which is why you sometimes get the OOM exception instead of the process just getting killed. For ST_Union_Agg, the actual union operation is handled by the GEOS library, which allocates memory outside of DuckDB's control, so DuckDB can't track the memory used by the final result geometry until the aggregate actually finishes.

Depending on how many groups row_id creates, this means that you will have to keep all the unioned geometries in memory at the same time until the full operation finishes, so its unlikely that adjusting the amount of threads DuckDB uses would make any difference. We do have some work in progress in DuckDB core that can make use of partitioning (and eventually sorting) information to avoid keeping aggregate states around if the system knows no more entries will be part of the group, but that's probably a long way out.

In general the spatial union operation is one of the most memory intensive things you can do in geospatial processing. I haven't looked at our implementation in a while so there might be some optimizations we could do that I missed the first time around, but as long as we rely on GEOS I think there will always be some amount of uncontrollable memory usage. Like most things related to spatial I'd like to one day provide our own implementation, but building a overlay engine in particular would require a to a ton of work, and testing to catch all the edge cases. Even then I doubt st_union_agg would ever be immune to memory constraints.

So to summarize, I'll try to find some time to revisit this part of the code again and see if it can be optimized, but its probably always going to cause a lot of memory pressure, and I don't think there is any workarounds currently.

Oct 18 '24 17:10 Maxxen

Thank you for both such a prompt and thoughtful reply.

I'm currently using DBT to orchestrate model creation. I'm going to have to employ some sort of dynamic batching feature.

With zero memory limitations and changing the underlying table this CTAS pulls from to only 10_000_000 records, the build completed in 15 seconds and no OOM kill. Great.

Thank you for clearing up my path of approach!

Oct 18 '24 18:10 johngcrowley

Feel free to mark as closed, if you'd like, as I believe it's a design issue at this point on my end.

Oct 18 '24 18:10 johngcrowley

duckdb_spatial duckdb_spatial copied to clipboard

OOM aggregate / `group by` query on Geometry type

Problem

Traceback

Steps to re-create

Environment:

duckdb_spatial
duckdb_spatial copied to clipboard