duckdb_spatial icon indicating copy to clipboard operation
duckdb_spatial copied to clipboard

Performance improvements for Geodatabase imports?

Open marklit opened this issue 1 year ago • 4 comments

I used the latest master branch to convert a 21 GB Geodatabase fileset into Parquet with some light enrichment. It took almost 7 hours exactly on a system with 64 cores and 64 GB of RAM. This works out to a read speed ~871 KB/s. Is there much that could be done to optimise for this format? Most datasets I process for clients are in this format.

LOAD parquet;

COPY (SELECT * EXCLUDE(GEOMETRY_BIN),

             printf('%x',
              h3_latlng_to_cell(
                  ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  7)::bigint) as h3_7,

             printf('%x',
              h3_latlng_to_cell(
                  ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  8)::bigint) as h3_8,

             printf('%x',
              h3_latlng_to_cell(
                  ST_Y(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  ST_X(ST_CENTROID(GEOMETRY_BIN::GEOMETRY)),
                  9)::bigint) as h3_9,

             ST_AsHEXWKB(GEOMETRY_BIN::GEOMETRY)::TEXT AS geom

      FROM st_read('test.gdb/a00000011.gdbtable'))
TO 'test.gdb/a00000011.pq' (FORMAT 'PARQUET',
                            CODEC  'Snappy');

marklit avatar May 08 '23 10:05 marklit