duckdb_spatial
duckdb_spatial copied to clipboard
Avoid scanning entire Geodatabase file
It appears an entire scan of a Geodatabase file is needed before a single record can be returned. The following pulls one record from an 870 MB file and takes ~36 seconds to do so.
select FIPS_CODE FROM st_read('SF_Professional_AK.gdb/a00000011.gdbtable') LIMIT 1;
┌───────────┐
│ FIPS_CODE │
│ varchar │
├───────────┤
│ 11111 │
└───────────┘
Run Time (s): real 36.234 user 33.853635 sys 2.386451
Does DESCRIBE FROM st_read('SF_Professional_AK.gdb/a00000011.gdbtable') LIMIT 1; take the same amount of time?
We use GDAL's arrow api to read datasets column by column, but I don't know how it functions or materializes the columns internally. I also really don't know much about how the GeoDatabase format works either, but from what I can tell from a link from the docs, it seems like the schema is fixed, which should make the binding phase/DESCRIBE fast. I would try to toy around with some of the driver specific open/layer options to see if performance could be improved.
We could provide an alternative ST_Read implementation that uses GDAL's normal imperative API to read the data incrementally (although, it all depends on the driver), but when I tested this way back it was much slower than the arrow api, hence why we settled on it.
The describe statement is near-instant.
It would be nice to produce default settings that avoid full-file scans by default so that ST_READ on its own is more or less all the end user would ever need. I suspect this also relates to #67 as it produces an OOM when really it should be reading and dumping out data as it works its way through the file.
So after some digging I think I've found the problem in the arrow api:
The default implementation uses GetNextFeature() internally to retrieve batches of up to 65,536 features (configurable with the MAX_FEATURES_IN_BATCH=num option
We should absolutely expose the MAX_FEATURES_IN_BATCH option as an argument to ST_Read.
Okay, let's do that. I'll then have another go with this file and report back. I'm surprised 65K "features", whatever that is measuring, took 30+ seconds to go through.
Should be alleviated alongside #67 now that #69 is merged.
I rebuilt the master branch and I'm seeing a 3x speed up but it's still ~10 seconds.
$ GEN=ninja make release
mkdir -p build/release && \
cmake -G "Ninja" -DFORCE_COLORED_OUTPUT=1 -DENABLE_SANITIZER=OFF -DDUCKDB_OOT_EXTENSION_NAMES="spatial" -DDUCKDB_OOT_EXTENSION_SPATIAL_PATH="/home/mark/duckdb_spatial/spatial" -DDUCKDB_OOT_EXTENSION_SPATIAL_SHOULD_LINK="TRUE" -DDUCKDB_OOT_EXTENSION_SPATIAL_INCLUDE_PATH="/home/mark/duckdb_spatial/spatial/include" -DEXTENSION_STATIC_BUILD=1 -DCMAKE_BUILD_TYPE=Release -DEXTENSION_STATIC_BUILD=1 -S ./duckdb/ -B build/release && \
cmake --build build/release --config Release
-- git hash b00b93f0b1, version v0.7.1
-- Configuring done
-- Generating done
-- Build files have been written to: /home/mark/duckdb_spatial/build/release/dependencies
ninja: no work to do.
-- Configuring done
-- Generating done
-- Build files have been written to: /home/mark/duckdb_spatial/build/release
[13/13] Linking CXX executable test/unittest
D select FIPS_CODE FROM st_read('SF_Professional_AK.gdb/a00000011.gdbtable') LIMIT 1;
100% ▕████████████████████████████████████████████████████████████▏
┌───────────┐
│ FIPS_CODE │
│ varchar │
├───────────┤
│ 02185 │
└───────────┘
Run Time (s): real 10.411 user 10.749468 sys 0.302476
I'm not very familiar with the geodatbase format but I can try to test it out myself later, there may be an option or something we can pass to make it faster for this driver in particular.