duckdb_spatial
duckdb_spatial copied to clipboard
Performance improvements for Geodatabase imports?
I used the latest master branch to convert a 21 GB Geodatabase fileset into Parquet with some light enrichment. It took almost 7 hours exactly on a system with 64 cores and 64 GB of RAM. This works out to a read speed ~871 KB/s. Is there much that could be done to optimise for this format? Most datasets I process for clients are in this format.
LOAD parquet;
COPY (SELECT * EXCLUDE(GEOMETRY_BIN),
printf('%x',
h3_latlng_to_cell(
ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
7)::bigint) as h3_7,
printf('%x',
h3_latlng_to_cell(
ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
8)::bigint) as h3_8,
printf('%x',
h3_latlng_to_cell(
ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
9)::bigint) as h3_9,
ST_AsHEXWKB(GEOMETRY_BIN::GEOMETRY)::TEXT AS geom
FROM st_read('test.gdb/a00000011.gdbtable'))
TO 'test.gdb/a00000011.pq' (FORMAT 'PARQUET',
CODEC 'Snappy');
Is it equally slow if you were to use ogr2ogr to perform a similar conversion directly through gdal without duckdb? e.g. any conversion from Geodatabase to any other format, without any real data transformation. Does the performance scale linearly with file size?
Again, not too familiar with the Geodatabase format, but is there an attribute index present (".atx" file)?
Since we have no idea of how the underlying gdal driver works it's really hard to diagnose what's going on. From the docs there doesn't really seem to be many knobs we can turn when opening a geodatabase file for reading. The only thing I can see making a difference is changing up the access pattern through gdal, like not using the arrow api or always read through the file sequentially, but this would be a lot of work to change, and again, there's not telling when one or the other would be more efficient.
Also, how many columns/rows does the gdb contain? Have you tried setting the st_read(max_batch_size = <INTEGER>) to something lower than the default, like 1024/512?
Give me a day or so to put together an easy-to-repeat example with a PostGIS script for a performance comparison.
Download gNATSGO_CA.zip from https://nrcs.app.box.com/v/soils/folder/191785692827 (ignore the login / signup box, the download will start anyway). The ZIP is 694 MB. a00000020.gdbtable within that ZIP is 2.2 GB.
$ unzip -ojq gNATSGO_CA.zip
$ createdb usda
$ echo 'DROP SCHEMA IF EXISTS "usda" CASCADE;' | psql usda
The following completed in 2m52.212s.
$ time ogr2ogr \
--config PG_USE_COPY YES \
-f PGDump \
/vsistdout/ \
a00000020.gdbtable \
-lco SCHEMA=usda \
-skipfailures \
-append \
-update \
| psql usda
It was run on:
- Ubuntu 20 LTS
- PG 15.1 (Ubuntu 15.1-1.pgdg20.04+1)
$ ogr2ogr --versionGDAL 3.7.0dev-13d9e0bfa8, released 2023/01/19 (debug build)
$ psql usda
set search_path='usda';
\dts+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+----------------------+----------+-------+-------------+---------------+------------+-------------
usda | cointerp | table | mark | permanent | heap | 2727 MB |
usda | cointerp_ogc_fid_seq | sequence | mark | permanent | | 8192 bytes |
(2 rows)
The following completed in 45.315s. It's 3.8x faster but it's reading at ~50 MB/s on a drive that can do many GB/s. There are also 64 cores available on that machine but DuckDB never reports more than 200% CPU usage in htop.
$ ~/duckdb_spatial/build/release/duckdb -unsigned usda.duckdb
.timer on
CREATE OR REPLACE TABLE a00000011 AS
SELECT * FROM st_read('a00000020.gdbtable');
This is the table produced, it contains 14,819,625 rows and the DuckDB file is 444 MB in size.
D .schema --indent
CREATE TABLE a00000011(
cokey VARCHAR,
mrulekey VARCHAR,
mrulename VARCHAR,
seqnum SMALLINT,
rulekey VARCHAR,
rulename VARCHAR,
ruledepth SMALLINT,
interphr DOUBLE,
interphrc VARCHAR,
nullpropdatabool VARCHAR,
defpropdatabool VARCHAR,
incpropdatabool VARCHAR,
cointerpkey VARCHAR
);;