duckdb_spatial Spatial join very slow?

Hi,

I'm executing the following query on a 147K world city points table (loaded parquet file from https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/export/?disjunctive.cou_name_en&sort=name)

Find pairs of cities that are within 5 degrees (about 555 km) of each other.

but it ?never? finishes (v1.1.1 ) :-( PostGIS is ready within a second...

SELECT
    a.name AS city_a,
    b.name AS city_b,
    ST_Distance(
        a.coordinates,
        b.coordinates
    ) AS distance_degrees
FROM cities a
JOIN cities b ON a.geoname_id <> b.geoname_id
WHERE ST_Distance(
    a.coordinates,
    b.coordinates
) <= 5; -- Distance in degrees

Jan 10 '25 23:01 bertt

I'm having similar issues when running st_within.

Basically it never finishes.

Did you get any solution for this?

Jan 31 '25 20:01 brpy

Hey folks, I worked with Opendatasoft's support to give them feedback on how they were generating parquet files since querying them was unexpectedly slow. They have since made updates on how they create their parquet files--they even support exporting files with zstd compression.

Here's an example with the dataset that you provided featuring a zstd compressed parquet file.

curl -X 'GET' \
  'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/parquet?parquet_compression=zstd' \
  -H 'accept: */*'

Feb 06 '25 17:02 gregorywaynepower

tried with the new dataset above, but query is still very slow 👎

Feb 07 '25 09:02 bertt

Hello!

Just had a quick look at this, it seems that the <> join condition is the culprit as it turns the query plan from a HASH JOIN (fast) into a NESTED LOOP JOIN which basically has to compare everything with everything (M*N complexity). Now, why can't we do a hash join for inequality predicates? Im not entirely sure, but I'll ask around.

In the meantime, here's a (ugly) workaround:

PRAGMA disabled_optimizers='join_order,filter_pushdown';
SELECT
      a.name AS city_a,
      b.name AS city_b,
      ST_Distance(
          a.coordinates,
          b.coordinates
      ) AS distance_degrees
  FROM cities a
  JOIN cities b ON st_intersects(st_buffer(a.coordinates, 5), b.coordinates) 
 WHERE a.geoname_id != b.geoname_id;

By rewriting the distance predicate into an intersects on the buffer you trigger spatials I-E-join optimizer rule which performs a range join on the bounding box. (This rewrite should probably be done automatically in the future). Additionally you have to disable the join_order and filter_pushdown optimizers or DuckDB will try to push the WHERE predicate into the join condition again and turn it into a nested loop join before spatial has the chance to rewrite the query.

I know this isn't ideal, so I'll investigate more how we can make this work better.

Feb 07 '25 10:02 Maxxen

With the above changes, the final query plan becomes:

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│           city_a          │
│           city_b          │
│      distance_degrees     │
│                           │
│        ~147043 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (geoname_id != geoname_id)│
│                           │
│        ~147043 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│  ST_Intersects(ST_Buffer  │
│    (coordinates, 5.0),    │
│        coordinates)       │
│                           │
│        ~147043 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│          IE_JOIN          │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        │
│     ST_XMin(ST_Extent     │
│ (ST_Buffer(coordinates, 5 │
│ .0))) <= ST_XMax(ST_Extent│
│       (coordinates))      │
│     ST_XMax(ST_Extent     │
│ (ST_Buffer(coordinates, 5 │
│ .0))) >= ST_XMin(ST_Extent├──────────────┐
│       (coordinates))      │              │
│     ST_YMin(ST_Extent     │              │
│ (ST_Buffer(coordinates, 5 │              │
│ .0))) <= ST_YMax(ST_Extent│              │
│       (coordinates))      │              │
│     ST_YMax(ST_Extent     │              │
│ (ST_Buffer(coordinates, 5 │              │
│ .0))) >= ST_YMin(ST_Extent│              │
│       (coordinates))      │              │
│                           │              │
│        ~147043 Rows       │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│           cities          ││           cities          │
│                           ││                           │
│        Projections:       ││        Projections:       │
│        coordinates        ││        coordinates        │
│         geoname_id        ││         geoname_id        │
│            name           ││            name           │
│                           ││                           │
│        ~147043 Rows       ││        ~147043 Rows       │
└───────────────────────────┘└───────────────────────────┘

Feb 07 '25 10:02 Maxxen

still takes a long time (v1.2.0), but seeing progress bar now to 83%

Feb 07 '25 11:02 bertt

Yeah the buffering does probably add quite a bit of overhead, butI mean, in essence you are comparing everything to everything, you're going to end up with a result doing around ~21621643849 distance comparisons one way or another.

Running the following seems slightly faster

SELECT
        a.name AS city_a,
        b.name AS city_b,
        ST_Distance(
            a.coordinates,
            b.coordinates
        ) AS distance_degrees
    FROM cities a
    JOIN cities b ON st_distance(a.coordinates, b.coordinates) <= 5
   WHERE a.geoname_id != b.geoname_id;

But even if I run it with a limit on the LHS (FROM cities LIMIT 500) a with EXPLAIN ANALYZE the majority of the time is spent in the filter

┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (ST_Distance(coordinates, │
│    coordinates) <= 5.0)   │
│                           │
│        1493466 Rows       │
│          (14.23s)         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        ├──────────────┐
│  geoname_id != geoname_id │              │
│                           │              │
│       73521000 Rows       │              │
│          (0.13s)          │              │
└─────────────┬─────────────┘              │

Feb 07 '25 11:02 Maxxen

I don't know exactly what PostGIS does to make it that much faster... is the table indexed? I guess it would be possible to create a temporary r-tree index to prune out distance checks in a custom operator, but I don't know if that's what the Postgres optimizer does. We could definitely do that though with some work.

Feb 07 '25 11:02 Maxxen

@bertt You may want to take a look at some of these recommendations for GeoParquet--you may have better luck recreating those GeoParquet files around these suggestions.

https://github.com/opengeospatial/geoparquet/pull/254/files

Feb 07 '25 11:02 gregorywaynepower

Ok, the Postgres query plan seems to be basically equivalent to ours, here ive created a table with 100k rows of random ids and points.

CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 100000) as I

EXPLAIN SELECT
    a.id AS city_a,
    b.id AS city_b,
    ST_Distance(
        a.pos,
        b.pos
    ) AS distance_degrees
FROM t2 a JOIN t2 b ON a.id <> b.id
WHERE ST_Distance(
    a.pos,
    b.pos
) <= 5;

 Nested Loop  (cost=0.00..166919453918.00 rows=3333300000 width=18)
   Join Filter: (((a.id)::text <> (b.id)::text) AND (st_distance(a.pos, b.pos) <= '5'::double precision))
   ->  Seq Scan on t2 a  (cost=0.00..1834.00 rows=100000 width=37)
   ->  Materialize  (cost=0.00..3116.00 rows=100000 width=37)
         ->  Seq Scan on t2 b  (cost=0.00..1834.00 rows=100000 width=37)

And it doesn't seem to finish either. Applying an index on the pos columns doesn't change the plan. Is there anything else I'm missing to make this work on PostgreSQL?

If we instead generate a table with 10000 rows it eventually finishes in about 14s, but I don't know how you got the full 140k~ dataset to finish in a couple seconds while I can't even get 10k under 10s

(explain analyze output)

 Nested Loop  (cost=0.00..1668375393.00 rows=33330000 width=16) (actual time=0.043..14398.525 rows=59988 loops=1)
   Join Filter: (((a.id)::text <> (b.id)::text) AND (st_distance(a.pos, b.pos) <= '5'::double precision))
   Rows Removed by Join Filter: 99940012
   ->  Seq Scan on t2 a  (cost=0.00..184.00 rows=10000 width=36) (actual time=0.016..0.681 rows=10000 loops=1)
   ->  Materialize  (cost=0.00..234.00 rows=10000 width=36) (actual time=0.000..0.157 rows=10000 loops=10000)
         ->  Seq Scan on t2 b  (cost=0.00..184.00 rows=10000 width=36) (actual time=0.003..1.001 rows=10000 loops=1)
 Planning Time: 0.483 ms
 Execution Time: 14399.476 ms

While the equivalent query takes duckdb about 19s on my machine.

CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 10000) x(i);

SELECT
     a.id AS city_a,
     b.id AS city_b,
     ST_Distance(
         a.pos,
         b.pos
     ) AS distance_degrees
 FROM t2 a JOIN t2 b ON a.id <> b.id
 WHERE ST_Distance(
     a.pos,
     b.pos
 ) <= 5;


----
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT     a.id AS city_a,     b.id AS city_b,     ST_Distance(         a.pos,         b.pos     ) AS distance_degrees FROM t2 a JOIN t2 b ON a.id <> b.id WHERE ST_Distance(     a.pos,     b.pos ) <= 5;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 19.70s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│           city_a          │
│           city_b          │
│      distance_degrees     │
│                           │
│         59988 Rows        │
│          (0.01s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (ST_Distance(pos, pos) <= │
│            5.0)           │
│                           │
│         59988 Rows        │
│          (19.38s)         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: INNER     │
│    Conditions: id != id   ├──────────────┐
│                           │              │
│       99990000 Rows       │              │
│          (0.18s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│         Table: t2         ││         Table: t2         │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│                           ││                           │
│        Projections:       ││        Projections:       │
│             id            ││             id            │
│            pos            ││            pos            │
│                           ││                           │
│         10000 Rows        ││         10000 Rows        │
│          (0.00s)          ││          (0.00s)          │
└───────────────────────────┘└───────────────────────────┘

Now Postgis being about 5s faster (for 10000 rows) is still significant, and my guess as to why is that postgis has implemented ST_Distance natively while we still have to call out into GEOS's C-API, allocating and copying a bunch of memory back and forth. There are a lot of optimizations we can do here to reduce the overhead, in this case in particular because all geometries are just points.

In fact, if we change the table slightly and create duckdb POINT_2D instead of GEOMETRY types, using

CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point2d(i, i) as pos FROM generate_series(1, 10000) x(i);

The query finishes in less than second:

EXPLAIN ANALYZE SELECT
     a.id AS city_a,
     b.id AS city_b,
     ST_Distance(
         a.pos,
         b.pos
     ) AS distance_degrees
 FROM t2 a JOIN t2 b ON a.id <> b.id
 WHERE ST_Distance(
     a.pos,
     b.pos
) <= 5;

----
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT     a.id AS city_a,     b.id AS city_b,     ST_Distance(         a.pos,         b.pos     ) AS distance_degrees FROM t2 a JOIN t2 b ON a.id <> b.id WHERE ST_Distance(     a.pos,     b.pos ) <= 5;
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.949s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│           city_a          │
│           city_b          │
│      distance_degrees     │
│                           │
│         59988 Rows        │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (ST_Distance(pos, pos) <= │
│            5.0)           │
│                           │
│         59988 Rows        │
│          (0.61s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: INNER     │
│    Conditions: id != id   ├──────────────┐
│                           │              │
│       99990000 Rows       │              │
│          (0.19s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│         Table: t2         ││         Table: t2         │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│                           ││                           │
│        Projections:       ││        Projections:       │
│             id            ││             id            │
│            pos            ││            pos            │
│                           ││                           │
│         10000 Rows        ││         10000 Rows        │
│          (0.00s)          ││          (0.00s)          │
└───────────────────────────┘└───────────────────────────┘

So conclusion: we should optimize the ST_Distance variant for GEOMETRY in the same way as we have done for POINT_2D

Feb 07 '25 12:02 Maxxen

Alright, so I've been working quite a lot on optimizing ST_Distance and ST_DWithin

As of DuckDB v1.3.1 (soon to be released), the following query:

CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 10000) x(i);

SELECT
     a.id AS city_a,
     b.id AS city_b,
     ST_Distance(
         a.pos,
         b.pos
     ) AS distance_degrees
 FROM t2 a JOIN t2 b ON a.id <> b.id
 WHERE ST_Distance(
     a.pos,
     b.pos
 ) <= 5;

finishes in ~2s (down from ~20s) on my machine. If you can live without the a.id <> b.id check and rewrite it to using ST_DWithin, e.g.

SELECT
     a.id AS city_a,
     b.id AS city_b,
     ST_Distance(
         a.pos,
         b.pos
     ) AS distance_degrees
 FROM t2 a JOIN t2 b ON ST_DWithin(
     a.pos,
     b.pos, 
     5)

The spatial join operator is triggered and the query finishes instantly (e.g. 0.032s on my machine).

Now, the original query and dataset is still going to be planned as a BLOCKWISE_NL_JOIN, requiring 21621643849 distance computations due to the comparison of the id columns taking precedence during join planning (- basically, DuckDB always assumes its better to join on a simple comparison instead of an arbitrary function call). Improving the optimizer so that it is more spatial-aware is something we need to work more on in the future.

But similarly to the above query, if you can live without the id comparison and execute the original query with ST_DWithin, you get a SPATIAL_JOIN operator instead which on my machine manages to finish the query in ~45s, producing 684269509 rows, which seems kind of reasonable?

In theory we should be able to make use of more parallelism, but again it comes down to assumptions that DuckDB core makes, as the number of tasks spawned is currently only decided by the number if input rows, not how costly they are to process. Hopefully thats something I can influence at a later point.

Jun 13 '25 12:06 Maxxen