duckdb_spatial Avoid scanning entire Geodatabase file

It appears an entire scan of a Geodatabase file is needed before a single record can be returned. The following pulls one record from an 870 MB file and takes ~36 seconds to do so.

select FIPS_CODE FROM st_read('SF_Professional_AK.gdb/a00000011.gdbtable') LIMIT 1;

┌───────────┐
│ FIPS_CODE │
│  varchar  │
├───────────┤
│ 11111     │
└───────────┘

Run Time (s): real 36.234 user 33.853635 sys 2.386451

May 04 '23 12:05 marklit

Does DESCRIBE FROM st_read('SF_Professional_AK.gdb/a00000011.gdbtable') LIMIT 1; take the same amount of time? We use GDAL's arrow api to read datasets column by column, but I don't know how it functions or materializes the columns internally. I also really don't know much about how the GeoDatabase format works either, but from what I can tell from a link from the docs, it seems like the schema is fixed, which should make the binding phase/DESCRIBE fast. I would try to toy around with some of the driver specific open/layer options to see if performance could be improved.

We could provide an alternative ST_Read implementation that uses GDAL's normal imperative API to read the data incrementally (although, it all depends on the driver), but when I tested this way back it was much slower than the arrow api, hence why we settled on it.

May 04 '23 13:05 Maxxen

The describe statement is near-instant.

It would be nice to produce default settings that avoid full-file scans by default so that ST_READ on its own is more or less all the end user would ever need. I suspect this also relates to #67 as it produces an OOM when really it should be reading and dumping out data as it works its way through the file.

May 04 '23 13:05 marklit

So after some digging I think I've found the problem in the arrow api:

The default implementation uses GetNextFeature() internally to retrieve batches of up to 65,536 features (configurable with the MAX_FEATURES_IN_BATCH=num option

We should absolutely expose the MAX_FEATURES_IN_BATCH option as an argument to ST_Read.

May 04 '23 13:05 Maxxen

Okay, let's do that. I'll then have another go with this file and report back. I'm surprised 65K "features", whatever that is measuring, took 30+ seconds to go through.

May 04 '23 13:05 marklit

Should be alleviated alongside #67 now that #69 is merged.

May 04 '23 15:05 Maxxen

I rebuilt the master branch and I'm seeing a 3x speed up but it's still ~10 seconds.

$ GEN=ninja make release
mkdir -p build/release && \
cmake -G "Ninja" -DFORCE_COLORED_OUTPUT=1 -DENABLE_SANITIZER=OFF -DDUCKDB_OOT_EXTENSION_NAMES="spatial" -DDUCKDB_OOT_EXTENSION_SPATIAL_PATH="/home/mark/duckdb_spatial/spatial" -DDUCKDB_OOT_EXTENSION_SPATIAL_SHOULD_LINK="TRUE" -DDUCKDB_OOT_EXTENSION_SPATIAL_INCLUDE_PATH="/home/mark/duckdb_spatial/spatial/include"  -DEXTENSION_STATIC_BUILD=1 -DCMAKE_BUILD_TYPE=Release -DEXTENSION_STATIC_BUILD=1   -S ./duckdb/ -B build/release && \
cmake --build build/release --config Release
-- git hash b00b93f0b1, version v0.7.1
-- Configuring done
-- Generating done
-- Build files have been written to: /home/mark/duckdb_spatial/build/release/dependencies
ninja: no work to do.
-- Configuring done
-- Generating done
-- Build files have been written to: /home/mark/duckdb_spatial/build/release
[13/13] Linking CXX executable test/unittest

D select FIPS_CODE FROM st_read('SF_Professional_AK.gdb/a00000011.gdbtable') LIMIT 1;
100% ▕████████████████████████████████████████████████████████████▏ 
┌───────────┐
│ FIPS_CODE │
│  varchar  │
├───────────┤
│ 02185     │
└───────────┘
Run Time (s): real 10.411 user 10.749468 sys 0.302476

May 05 '23 10:05 marklit

I'm not very familiar with the geodatbase format but I can try to test it out myself later, there may be an option or something we can pass to make it faster for this driver in particular.

May 05 '23 11:05 Maxxen

duckdb_spatial duckdb_spatial copied to clipboard

Avoid scanning entire Geodatabase file

duckdb_spatial
duckdb_spatial copied to clipboard