Spatial join very slow?
Hi,
I'm executing the following query on a 147K world city points table (loaded parquet file from https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/export/?disjunctive.cou_name_en&sort=name)
Find pairs of cities that are within 5 degrees (about 555 km) of each other.
but it ?never? finishes (v1.1.1 ) :-( PostGIS is ready within a second...
SELECT
a.name AS city_a,
b.name AS city_b,
ST_Distance(
a.coordinates,
b.coordinates
) AS distance_degrees
FROM cities a
JOIN cities b ON a.geoname_id <> b.geoname_id
WHERE ST_Distance(
a.coordinates,
b.coordinates
) <= 5; -- Distance in degrees
I'm having similar issues when running st_within.
Basically it never finishes.
Did you get any solution for this?
Hey folks, I worked with Opendatasoft's support to give them feedback on how they were generating parquet files since querying them was unexpectedly slow. They have since made updates on how they create their parquet files--they even support exporting files with zstd compression.
Here's an example with the dataset that you provided featuring a zstd compressed parquet file.
curl -X 'GET' \
'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/parquet?parquet_compression=zstd' \
-H 'accept: */*'
tried with the new dataset above, but query is still very slow π
Hello!
Just had a quick look at this, it seems that the <> join condition is the culprit as it turns the query plan from a HASH JOIN (fast) into a NESTED LOOP JOIN which basically has to compare everything with everything (M*N complexity). Now, why can't we do a hash join for inequality predicates? Im not entirely sure, but I'll ask around.
In the meantime, here's a (ugly) workaround:
PRAGMA disabled_optimizers='join_order,filter_pushdown';
SELECT
a.name AS city_a,
b.name AS city_b,
ST_Distance(
a.coordinates,
b.coordinates
) AS distance_degrees
FROM cities a
JOIN cities b ON st_intersects(st_buffer(a.coordinates, 5), b.coordinates)
WHERE a.geoname_id != b.geoname_id;
By rewriting the distance predicate into an intersects on the buffer you trigger spatials I-E-join optimizer rule which performs a range join on the bounding box. (This rewrite should probably be done automatically in the future). Additionally you have to disable the join_order and filter_pushdown optimizers or DuckDB will try to push the WHERE predicate into the join condition again and turn it into a nested loop join before spatial has the chance to rewrite the query.
I know this isn't ideal, so I'll investigate more how we can make this work better.
With the above changes, the final query plan becomes:
βββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββ
ββ Physical Plan ββ
βββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββ
β PROJECTION β
β ββββββββββββββββββββ β
β city_a β
β city_b β
β distance_degrees β
β β
β ~147043 Rows β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β FILTER β
β ββββββββββββββββββββ β
β (geoname_id != geoname_id)β
β β
β ~147043 Rows β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β FILTER β
β ββββββββββββββββββββ β
β ST_Intersects(ST_Buffer β
β (coordinates, 5.0), β
β coordinates) β
β β
β ~147043 Rows β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β IE_JOIN β
β ββββββββββββββββββββ β
β Join Type: INNER β
β β
β Conditions: β
β ST_XMin(ST_Extent β
β (ST_Buffer(coordinates, 5 β
β .0))) <= ST_XMax(ST_Extentβ
β (coordinates)) β
β ST_XMax(ST_Extent β
β (ST_Buffer(coordinates, 5 β
β .0))) >= ST_XMin(ST_Extentββββββββββββββββ
β (coordinates)) β β
β ST_YMin(ST_Extent β β
β (ST_Buffer(coordinates, 5 β β
β .0))) <= ST_YMax(ST_Extentβ β
β (coordinates)) β β
β ST_YMax(ST_Extent β β
β (ST_Buffer(coordinates, 5 β β
β .0))) >= ST_YMin(ST_Extentβ β
β (coordinates)) β β
β β β
β ~147043 Rows β β
βββββββββββββββ¬ββββββββββββββ β
βββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ
β SEQ_SCAN ββ SEQ_SCAN β
β ββββββββββββββββββββ ββ ββββββββββββββββββββ β
β cities ββ cities β
β ββ β
β Projections: ββ Projections: β
β coordinates ββ coordinates β
β geoname_id ββ geoname_id β
β name ββ name β
β ββ β
β ~147043 Rows ββ ~147043 Rows β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
still takes a long time (v1.2.0), but seeing progress bar now to 83%
Yeah the buffering does probably add quite a bit of overhead, butI mean, in essence you are comparing everything to everything, you're going to end up with a result doing around ~21621643849 distance comparisons one way or another.
Running the following seems slightly faster
SELECT
a.name AS city_a,
b.name AS city_b,
ST_Distance(
a.coordinates,
b.coordinates
) AS distance_degrees
FROM cities a
JOIN cities b ON st_distance(a.coordinates, b.coordinates) <= 5
WHERE a.geoname_id != b.geoname_id;
But even if I run it with a limit on the LHS (FROM cities LIMIT 500) a with EXPLAIN ANALYZE the majority of the time is spent in the filter
βββββββββββββββ΄ββββββββββββββ
β FILTER β
β ββββββββββββββββββββ β
β (ST_Distance(coordinates, β
β coordinates) <= 5.0) β
β β
β 1493466 Rows β
β (14.23s) β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β NESTED_LOOP_JOIN β
β ββββββββββββββββββββ β
β Join Type: INNER β
β β
β Conditions: ββββββββββββββββ
β geoname_id != geoname_id β β
β β β
β 73521000 Rows β β
β (0.13s) β β
βββββββββββββββ¬ββββββββββββββ β
I don't know exactly what PostGIS does to make it that much faster... is the table indexed? I guess it would be possible to create a temporary r-tree index to prune out distance checks in a custom operator, but I don't know if that's what the Postgres optimizer does. We could definitely do that though with some work.
@bertt You may want to take a look at some of these recommendations for GeoParquet--you may have better luck recreating those GeoParquet files around these suggestions.
https://github.com/opengeospatial/geoparquet/pull/254/files
Ok, the Postgres query plan seems to be basically equivalent to ours, here ive created a table with 100k rows of random ids and points.
CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 100000) as I
EXPLAIN SELECT
a.id AS city_a,
b.id AS city_b,
ST_Distance(
a.pos,
b.pos
) AS distance_degrees
FROM t2 a JOIN t2 b ON a.id <> b.id
WHERE ST_Distance(
a.pos,
b.pos
) <= 5;
Nested Loop (cost=0.00..166919453918.00 rows=3333300000 width=18)
Join Filter: (((a.id)::text <> (b.id)::text) AND (st_distance(a.pos, b.pos) <= '5'::double precision))
-> Seq Scan on t2 a (cost=0.00..1834.00 rows=100000 width=37)
-> Materialize (cost=0.00..3116.00 rows=100000 width=37)
-> Seq Scan on t2 b (cost=0.00..1834.00 rows=100000 width=37)
And it doesn't seem to finish either. Applying an index on the pos columns doesn't change the plan. Is there anything else I'm missing to make this work on PostgreSQL?
If we instead generate a table with 10000 rows it eventually finishes in about 14s, but I don't know how you got the full 140k~ dataset to finish in a couple seconds while I can't even get 10k under 10s
(explain analyze output)
Nested Loop (cost=0.00..1668375393.00 rows=33330000 width=16) (actual time=0.043..14398.525 rows=59988 loops=1)
Join Filter: (((a.id)::text <> (b.id)::text) AND (st_distance(a.pos, b.pos) <= '5'::double precision))
Rows Removed by Join Filter: 99940012
-> Seq Scan on t2 a (cost=0.00..184.00 rows=10000 width=36) (actual time=0.016..0.681 rows=10000 loops=1)
-> Materialize (cost=0.00..234.00 rows=10000 width=36) (actual time=0.000..0.157 rows=10000 loops=10000)
-> Seq Scan on t2 b (cost=0.00..184.00 rows=10000 width=36) (actual time=0.003..1.001 rows=10000 loops=1)
Planning Time: 0.483 ms
Execution Time: 14399.476 ms
While the equivalent query takes duckdb about 19s on my machine.
CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 10000) x(i);
SELECT
a.id AS city_a,
b.id AS city_b,
ST_Distance(
a.pos,
b.pos
) AS distance_degrees
FROM t2 a JOIN t2 b ON a.id <> b.id
WHERE ST_Distance(
a.pos,
b.pos
) <= 5;
----
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
ββ Query Profiling Information ββ
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
EXPLAIN ANALYZE SELECT a.id AS city_a, b.id AS city_b, ST_Distance( a.pos, b.pos ) AS distance_degrees FROM t2 a JOIN t2 b ON a.id <> b.id WHERE ST_Distance( a.pos, b.pos ) <= 5;
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ Total Time: 19.70s ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββ
β QUERY β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β EXPLAIN_ANALYZE β
β ββββββββββββββββββββ β
β 0 Rows β
β (0.00s) β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β PROJECTION β
β ββββββββββββββββββββ β
β city_a β
β city_b β
β distance_degrees β
β β
β 59988 Rows β
β (0.01s) β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β FILTER β
β ββββββββββββββββββββ β
β (ST_Distance(pos, pos) <= β
β 5.0) β
β β
β 59988 Rows β
β (19.38s) β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β NESTED_LOOP_JOIN β
β ββββββββββββββββββββ β
β Join Type: INNER β
β Conditions: id != id ββββββββββββββββ
β β β
β 99990000 Rows β β
β (0.18s) β β
βββββββββββββββ¬ββββββββββββββ β
βββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ
β TABLE_SCAN ββ TABLE_SCAN β
β ββββββββββββββββββββ ββ ββββββββββββββββββββ β
β Table: t2 ββ Table: t2 β
β Type: Sequential Scan ββ Type: Sequential Scan β
β ββ β
β Projections: ββ Projections: β
β id ββ id β
β pos ββ pos β
β ββ β
β 10000 Rows ββ 10000 Rows β
β (0.00s) ββ (0.00s) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Now Postgis being about 5s faster (for 10000 rows) is still significant, and my guess as to why is that postgis has implemented ST_Distance natively while we still have to call out into GEOS's C-API, allocating and copying a bunch of memory back and forth. There are a lot of optimizations we can do here to reduce the overhead, in this case in particular because all geometries are just points.
In fact, if we change the table slightly and create duckdb POINT_2D instead of GEOMETRY types, using
CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point2d(i, i) as pos FROM generate_series(1, 10000) x(i);
The query finishes in less than second:
EXPLAIN ANALYZE SELECT
a.id AS city_a,
b.id AS city_b,
ST_Distance(
a.pos,
b.pos
) AS distance_degrees
FROM t2 a JOIN t2 b ON a.id <> b.id
WHERE ST_Distance(
a.pos,
b.pos
) <= 5;
----
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
ββ Query Profiling Information ββ
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
EXPLAIN ANALYZE SELECT a.id AS city_a, b.id AS city_b, ST_Distance( a.pos, b.pos ) AS distance_degrees FROM t2 a JOIN t2 b ON a.id <> b.id WHERE ST_Distance( a.pos, b.pos ) <= 5;
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ Total Time: 0.949s ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββ
β QUERY β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β EXPLAIN_ANALYZE β
β ββββββββββββββββββββ β
β 0 Rows β
β (0.00s) β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β PROJECTION β
β ββββββββββββββββββββ β
β city_a β
β city_b β
β distance_degrees β
β β
β 59988 Rows β
β (0.00s) β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β FILTER β
β ββββββββββββββββββββ β
β (ST_Distance(pos, pos) <= β
β 5.0) β
β β
β 59988 Rows β
β (0.61s) β
βββββββββββββββ¬ββββββββββββββ
βββββββββββββββ΄ββββββββββββββ
β NESTED_LOOP_JOIN β
β ββββββββββββββββββββ β
β Join Type: INNER β
β Conditions: id != id ββββββββββββββββ
β β β
β 99990000 Rows β β
β (0.19s) β β
βββββββββββββββ¬ββββββββββββββ β
βββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ
β TABLE_SCAN ββ TABLE_SCAN β
β ββββββββββββββββββββ ββ ββββββββββββββββββββ β
β Table: t2 ββ Table: t2 β
β Type: Sequential Scan ββ Type: Sequential Scan β
β ββ β
β Projections: ββ Projections: β
β id ββ id β
β pos ββ pos β
β ββ β
β 10000 Rows ββ 10000 Rows β
β (0.00s) ββ (0.00s) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
So conclusion: we should optimize the ST_Distance variant for GEOMETRY in the same way as we have done for POINT_2D
Alright, so I've been working quite a lot on optimizing ST_Distance and ST_DWithin
As of DuckDB v1.3.1 (soon to be released), the following query:
CREATE TABLE t2 AS SELECT i::VARCHAR as id, st_point(i, i) as pos FROM generate_series(1, 10000) x(i);
SELECT
a.id AS city_a,
b.id AS city_b,
ST_Distance(
a.pos,
b.pos
) AS distance_degrees
FROM t2 a JOIN t2 b ON a.id <> b.id
WHERE ST_Distance(
a.pos,
b.pos
) <= 5;
finishes in ~2s (down from ~20s) on my machine. If you can live without the a.id <> b.id check and rewrite it to using ST_DWithin, e.g.
SELECT
a.id AS city_a,
b.id AS city_b,
ST_Distance(
a.pos,
b.pos
) AS distance_degrees
FROM t2 a JOIN t2 b ON ST_DWithin(
a.pos,
b.pos,
5)
The spatial join operator is triggered and the query finishes instantly (e.g. 0.032s on my machine).
Now, the original query and dataset is still going to be planned as a BLOCKWISE_NL_JOIN, requiring 21621643849 distance computations due to the comparison of the id columns taking precedence during join planning (- basically, DuckDB always assumes its better to join on a simple comparison instead of an arbitrary function call). Improving the optimizer so that it is more spatial-aware is something we need to work more on in the future.
But similarly to the above query, if you can live without the id comparison and execute the original query with ST_DWithin, you get a SPATIAL_JOIN operator instead which on my machine manages to finish the query in ~45s, producing 684269509 rows, which seems kind of reasonable?
In theory we should be able to make use of more parallelism, but again it comes down to assumptions that DuckDB core makes, as the number of tasks spawned is currently only decided by the number if input rows, not how costly they are to process. Hopefully thats something I can influence at a later point.