Query performance is slower after using SPATIAL_JOIN
With the release of DuckDB v1.3.0, performance improvements for the spatial extension were announced. I ran some tests to evaluate this.
However, in my specific test scenario, I found that queries are actually faster when SPATIAL_JOIN is not used compared to when it is used.
❯ duckdb --version v1.3.0 71c5c07cdd
❯ uname -a Darwin mini.local 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:41 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8103 arm64
Here is my test case:
pre.shp is an isoline dataset for precipitation values ranging from 0.1 to 10.
grid.shp is a point dataset.
Test files are attached. data.zip
load spatial;
create table pre(
id integer primary key,
geom geometry
);
create sequence seq_pre_id start 1;
insert into pre(id,geom) select nextval('seq_pre_id'),geom from st_read('pre.shp');
create table grid(
id integer primary key,
geom geometry
);
create sequence seq_grid_id START 1;
insert into grid(id,geom) select nextval('seq_grid_id'),geom from st_read('grid.shp');
.timer on
select count(*) from
pre a,grid b where ST_Intersects(b.geom,a.geom);
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 228909 │
└──────────────┘
Run Time (s): real 14.055 user 45.078702 sys 0.101822
explain
select count(*) from
pre a,grid b where ST_Intersects(b.geom,a.geom);
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ UNGROUPED_AGGREGATE │
│ ──────────────────── │
│ Aggregates: │
│ count_star() │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SPATIAL_JOIN │
│ ──────────────────── │
│ Join Type: INNER │
│ │
│ Conditions: ├──────────────┐
│ ST_Intersects(geom, geom) │ │
│ │ │
│ ~512260 Rows │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ SEQ_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ Table: grid ││ Table: pre │
│ Type: Sequential Scan ││ Type: Sequential Scan │
│ Projections: geom ││ Projections: geom │
│ ││ │
│ ~512260 Rows ││ ~241 Rows │
└───────────────────────────┘└───────────────────────────┘
Run Time (s): real 0.005 user 0.001407 sys 0.001659
pragma disabled_optimizers = 'extension';
select count(*) from
pre a,grid b where ST_Intersects(b.geom,a.geom);
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 228909 │
└──────────────┘
Run Time (s): real 10.304 user 55.053951 sys 0.247437
explain
select count(*) from
pre a,grid b where ST_Intersects(b.geom,a.geom);
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ UNGROUPED_AGGREGATE │
│ ──────────────────── │
│ Aggregates: │
│ count_star() │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ BLOCKWISE_NL_JOIN │
│ ──────────────────── │
│ Join Type: INNER │
│ ├──────────────┐
│ Condition: │ │
│ ST_Intersects(geom, geom) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ SEQ_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ Table: grid ││ Table: pre │
│ Type: Sequential Scan ││ Type: Sequential Scan │
│ Projections: geom ││ Projections: geom │
│ ││ │
│ ~512260 Rows ││ ~241 Rows │
└───────────────────────────┘└───────────────────────────┘
Run Time (s): real 0.003 user 0.001087 sys 0.001004
Hello! Thanks for reporting! My first thought was that the build side (pre) was not large enough to benefit from indexing, but changing the capacity of the r-tree nodes does not make a difference. I see the same behavior as you when using the released duckdb version, but when I build and link duckdb-spatial locally the spatial join is faster, although It also seems like the blockwise-nl-join uses more parallelism. I'm not sure why this is but will dig in deeper.
Ok, I think I figured it out. In the blockwise-nl-join, we basically execute a cross-product and then evaluate the predicate on each pair. Because DuckDB is vectorized engine, we're being clever and emit pairs of constant and flat vectors. We have a special path in place in spatial when we're evaluating GEOS based predicates where one of the inputs is a constant vector which allows to to avoid de-serializing the same geometry multiple times. However, in the spatial join we're trying to be even more clever and evaluate predicates in batches, by not just emitting constants (one for each entry in the probe side), but instead dictionary vectors, because we probe multiple left-hand side values in one go. However, there is no special handling of dictionary vectors in our spatial function path, so we end up deserializing the same few (as low as 1 or 2 even) entries in total 2048 times anyways, which puts a lot of pressure on the memory allocator (GEOS is not very optimized for vectorized execution) and thus impacts both performance and parallelism.
The solution here is to implement a general deserialization cache for all our GEOS functions so that we don't deserialize in the vectorized hot-loop. I've experimented a bit with this before and it should be beneficial for other use cases outside just joins, so I'll see what I can do.