duckdb_spatial icon indicating copy to clipboard operation
duckdb_spatial copied to clipboard

Performance improvements for Geodatabase imports?

Open marklit opened this issue 2 years ago • 4 comments

I used the latest master branch to convert a 21 GB Geodatabase fileset into Parquet with some light enrichment. It took almost 7 hours exactly on a system with 64 cores and 64 GB of RAM. This works out to a read speed ~871 KB/s. Is there much that could be done to optimise for this format? Most datasets I process for clients are in this format.

LOAD parquet;

COPY (SELECT * EXCLUDE(GEOMETRY_BIN),

             printf('%x',
              h3_latlng_to_cell(
                  ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  7)::bigint) as h3_7,

             printf('%x',
              h3_latlng_to_cell(
                  ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  8)::bigint) as h3_8,

             printf('%x',
              h3_latlng_to_cell(
                  ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  9)::bigint) as h3_9,

             ST_AsHEXWKB(GEOMETRY_BIN::GEOMETRY)::TEXT AS geom

      FROM st_read('test.gdb/a00000011.gdbtable'))
TO 'test.gdb/a00000011.pq' (FORMAT 'PARQUET',
                            CODEC  'Snappy');

marklit avatar May 08 '23 10:05 marklit

Is it equally slow if you were to use ogr2ogr to perform a similar conversion directly through gdal without duckdb? e.g. any conversion from Geodatabase to any other format, without any real data transformation. Does the performance scale linearly with file size?

Again, not too familiar with the Geodatabase format, but is there an attribute index present (".atx" file)?

Since we have no idea of how the underlying gdal driver works it's really hard to diagnose what's going on. From the docs there doesn't really seem to be many knobs we can turn when opening a geodatabase file for reading. The only thing I can see making a difference is changing up the access pattern through gdal, like not using the arrow api or always read through the file sequentially, but this would be a lot of work to change, and again, there's not telling when one or the other would be more efficient.

Maxxen avatar May 08 '23 10:05 Maxxen

Also, how many columns/rows does the gdb contain? Have you tried setting the st_read(max_batch_size = <INTEGER>) to something lower than the default, like 1024/512?

Maxxen avatar May 08 '23 11:05 Maxxen

Give me a day or so to put together an easy-to-repeat example with a PostGIS script for a performance comparison.

marklit avatar May 08 '23 11:05 marklit

Download gNATSGO_CA.zip from https://nrcs.app.box.com/v/soils/folder/191785692827 (ignore the login / signup box, the download will start anyway). The ZIP is 694 MB. a00000020.gdbtable within that ZIP is 2.2 GB.

$ unzip -ojq gNATSGO_CA.zip

$ createdb usda
$ echo 'DROP SCHEMA IF EXISTS "usda" CASCADE;' | psql usda

The following completed in 2m52.212s.

$ time ogr2ogr \
         --config PG_USE_COPY YES \
           -f PGDump \
           /vsistdout/ \
           a00000020.gdbtable \
           -lco SCHEMA=usda \
           -skipfailures \
           -append \
           -update \
       | psql usda

It was run on:

  • Ubuntu 20 LTS
  • PG 15.1 (Ubuntu 15.1-1.pgdg20.04+1)
  • $ ogr2ogr --version GDAL 3.7.0dev-13d9e0bfa8, released 2023/01/19 (debug build)
$ psql usda
set search_path='usda';
\dts+
                                             List of relations
 Schema |         Name         |   Type   | Owner | Persistence | Access method |    Size    | Description 
--------+----------------------+----------+-------+-------------+---------------+------------+-------------
 usda   | cointerp             | table    | mark  | permanent   | heap          | 2727 MB    | 
 usda   | cointerp_ogc_fid_seq | sequence | mark  | permanent   |               | 8192 bytes | 
(2 rows)

The following completed in 45.315s. It's 3.8x faster but it's reading at ~50 MB/s on a drive that can do many GB/s. There are also 64 cores available on that machine but DuckDB never reports more than 200% CPU usage in htop.

$ ~/duckdb_spatial/build/release/duckdb -unsigned usda.duckdb
.timer on

CREATE OR REPLACE TABLE a00000011 AS 
    SELECT * FROM st_read('a00000020.gdbtable');

This is the table produced, it contains 14,819,625 rows and the DuckDB file is 444 MB in size.

D .schema --indent
CREATE TABLE a00000011(
  cokey VARCHAR,
  mrulekey VARCHAR,
  mrulename VARCHAR,
  seqnum SMALLINT,
  rulekey VARCHAR,
  rulename VARCHAR,
  ruledepth SMALLINT,
  interphr DOUBLE,
  interphrc VARCHAR,
  nullpropdatabool VARCHAR,
  defpropdatabool VARCHAR,
  incpropdatabool VARCHAR,
  cointerpkey VARCHAR
);;

marklit avatar May 08 '23 12:05 marklit