duckdb_spatial
duckdb_spatial copied to clipboard
ST_READ very slow over network
TLDR;
ST_Readtook 216s of download time (download 184Mo) to create a table from a 100Mo gpkg file.wgetdownload it in 1.4s- R
sf::st_readtook 4.5s to download it
I'm on a 1Gb internet link
duckdb spatial time
I use duckdb cli 1.2.1 on linux and using ST_read over network is really slow :
D EXPLAIN ANALYZE CREATE OR REPLACE TABLE iris2 AS FROM ST_READ('https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg');
100% ▕████████████████████████████████████████████████████████████▏
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE CREATE OR REPLACE TABLE iris2 AS FROM ST_READ('https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg');
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ HTTPFS HTTP Stats ││
││ ││
││ in: 184.1 MiB ││
││ out: 0 bytes ││
││ #HEAD: 107 ││
││ #GET: 6317 ││
││ #PUT: 0 ││
││ #POST: 0 ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 216.90s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
wget time
Downloading directly with wget is really fast :
time wget -q https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg
wget -q 0,04s user 0,15s system 17% cpu 1,140 total
sf::st_read time
I tried sf package for comparison and it's quite fast too :
tictoc::tic()
sf::st_read('https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg')
#> Reading layer `iris_franceentiere_2024' from data source
#> `https://creacartes.s3.eu-west-3.amazonaws.com/2024/franceentiere/iris_franceentiere_2024.gpkg'
#> using driver `GPKG'
#> Simple feature collection with 16383 features and 3 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -63.15332 ymin: -21.38963 xmax: 55.83665 ymax: 51.0668
#> Geodetic CRS: WGS 84
tictoc::toc()
#> 4.921 sec elapsed
Created on 2025-04-19 with reprex v2.1.1
is httpfs the culprit ?
I tried httpfs on a parquet file and it's normally fast
EXPLAIN ANALYZE CREATE TABLE test2 AS FROM read_parquet('https://static.data.gouv.fr/resources/recensement-de-la-population-fichiers-detail-logements-ordinaires-en-2020-1/20231023-123618/fd-logemt-2020.parquet');
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ HTTPFS HTTP Stats ││
││ ││
││ in: 481.0 MiB ││
││ out: 0 bytes ││
││ #HEAD: 1 ││
││ #GET: 27 ││
││ #PUT: 0 ││
││ #POST: 0 ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 8.69s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
I got the same behaviour with a different GPKG on a private S3.
Hello!
GPKG isn't really a cloud-optimized format, meaning a reader had to do a lot of random small range requests and "jump around" when scanning a gpkg file remotely. You can see this by the 6000~ GET requests made in the analyze output. In GDAL this is alleviated by using a caching filesystem when doing remote reads, which basically reads larger ranges than necessary but also keeps previously read ranges cached in memory, so that subsequent read calls don't always have to go over http again.
However, because we plug into DuckDB's file system by default when using st_read, there is no range caching. You can work-around this by forcing the use of GDALs remote filesystem by prefixing /vsicurl/ to the URL, but then you won't be able to make use of DuckDBs secrets/s3 extension to manage authentication, as everything goes through GDAL.
With that said, there is work in progress on adding remote file caching to DuckDB as well, see e.g. https://github.com/duckdb/duckdb/pull/16463, but it hasn't made its way into spatial yet and there's still a lot of work to do before that.