duckdb spatial time

I use duckdb cli 1.2.1 on linux and using ST_read over network is really slow :

D EXPLAIN ANALYZE CREATE OR REPLACE TABLE iris2 AS FROM ST_READ('https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg');
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE CREATE OR REPLACE TABLE iris2 AS FROM ST_READ('https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg');
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││           in: 184.1 MiB           ││
││            out: 0 bytes           ││
││             #HEAD: 107            ││
││             #GET: 6317            ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 216.90s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘

wget time

Downloading directly with wget is really fast :

time wget -q https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg
wget -q   0,04s user 0,15s system 17% cpu 1,140 total

sf::st_read time

I tried sf package for comparison and it's quite fast too :

tictoc::tic()
sf::st_read('https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg')
#> Reading layer `iris_franceentiere_2024' from data source 
#>   `https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 16383 features and 3 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -63.15332 ymin: -21.38963 xmax: 55.83665 ymax: 51.0668
#> Geodetic CRS:  WGS 84
tictoc::toc()
#> 4.921 sec elapsed

^{Created on 2025-04-19 with reprex v2.1.1}

is httpfs the culprit ?

I tried httpfs on a parquet file and it's normally fast

EXPLAIN ANALYZE CREATE TABLE test2 AS FROM read_parquet('https://static.data.gouv.fr/resources/recensement-de-la-population-fichiers-detail-logements-ordinaires-en-2020-1/20231023-123618/fd-logemt-2020.parquet');
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││           in: 481.0 MiB           ││
││            out: 0 bytes           ││
││              #HEAD: 1             ││
││              #GET: 27             ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 8.69s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘

Apr 19 '25 07:04 nbc

I got the same behaviour with a different GPKG on a private S3.

Apr 28 '25 02:04 ar-siddiqui

Hello!

GPKG isn't really a cloud-optimized format, meaning a reader had to do a lot of random small range requests and "jump around" when scanning a gpkg file remotely. You can see this by the 6000~ GET requests made in the analyze output. In GDAL this is alleviated by using a caching filesystem when doing remote reads, which basically reads larger ranges than necessary but also keeps previously read ranges cached in memory, so that subsequent read calls don't always have to go over http again.

However, because we plug into DuckDB's file system by default when using st_read, there is no range caching. You can work-around this by forcing the use of GDALs remote filesystem by prefixing /vsicurl/ to the URL, but then you won't be able to make use of DuckDBs secrets/s3 extension to manage authentication, as everything goes through GDAL.

With that said, there is work in progress on adding remote file caching to DuckDB as well, see e.g. https://github.com/duckdb/duckdb/pull/16463, but it hasn't made its way into spatial yet and there's still a lot of work to do before that.

Apr 28 '25 13:04 Maxxen

duckdb_spatial
duckdb_spatial copied to clipboard

ST_READ very slow over network

duckdb spatial time

wget time

sf::st_read time

is httpfs the culprit ?

duckdb_spatial duckdb_spatial copied to clipboard

ST_READ very slow over network

duckdb spatial time

wget time

sf::st_read time

is httpfs the culprit ?

duckdb_spatial
duckdb_spatial copied to clipboard