pyogrio icon indicating copy to clipboard operation
pyogrio copied to clipboard

ENH: increase write performance of geopackage

Open theroggy opened this issue 2 years ago • 9 comments

I noticed some weird changes between write performance in geofileops vs. pyogrio. Looking into this I found out that having more cache memory available to SQLite while writing a geopackage speeds up writing by 25%.

Looking a bit deeper it isn't really the writing that gets faster, but it is the creation of the spatial index that speeds up a lot.

This logging shows this behaviour:

  • if the file is written without index, there is no difference when more memory is made available
  • if the index is written, the time of writing/creating the index halves when 50MB memory is available (instead of 2 MB default):
INFO:__main__:initial read took 7.421876
INFO:__main__:write (no pragma) took 39.741871
INFO:__main__:write (no config option, no index) took 15.013576
INFO:__main__:create spatial index (no config option) took 25.307128
INFO:__main__:write with config option {'OGR_SQLITE_CACHE': '50'} took 29.846694
INFO:__main__:write (with config option {'OGR_SQLITE_CACHE': '50'}, no index) took 14.785579
INFO:__main__:create spatial index (with config option {'OGR_SQLITE_CACHE': '50'}) took 15.918411
INFO:__main__:write (no config option) took 39.343089

So, I wonder... maybe it would be a good idea to:

  1. increase the memory available by default
    • add some guidance about this in the documentation?

To make it easier to use any gdal options I just added an "options" dict parameter in the geofileops I/O functions to be able to specify any of the 5 different types of options that exist in GDAL (including config options) to be applied. More info here: https://geofileops.readthedocs.io/en/latest/api/geofileops.convert.html#geofileops.convert Maybe that could be an interesting approach in pyogrio as well?

theroggy avatar Apr 14 '22 15:04 theroggy

Interesting...

increase the memory available by default

I don't know if there are some negative implications (@brendan-ward any idea?). If not, it sounds like a good idea.

add some guidance about this in the documentation?

Yes, we should do that for sure. Not just the memory but also document the ability to skip index creation. If you open your gpgks in geopandas only, there's no point in creating one.

I just added an "options" dict parameter ... Maybe that could be an interesting approach in pyogrio as well?

You should be able to pass GDAL options directly to write_dataframe as kwargs (as you fixed in #67), so there is probably no need for options dict?

martinfleis avatar Apr 20 '22 12:04 martinfleis

This seems like a good idea to add to both documentation and also the default memory (but maybe a smaller default, e.g., 24MB ?).

I'm not vary familiar with using layer creation options here. Are you able to pass those directly as kwargs, or is extra control over those required to split out layer creation options from dataset creation options, etc. ?

brendan-ward avatar Apr 20 '22 19:04 brendan-ward

For creating datasets there are 3 types of options relevant that need to be seperate:

  • dataset creation options: to be passed to GDALCreate()
  • layer creation options: to be passed to GDALDatasetCreateLayer()
  • gdal configuration options: they can be set/reset using the CPLSetConfigOption. These are "session based" settings, so they don't need to be strictly linked to the API to create a file/layer, but I think it is quite convenient to be able to use them that way so cleanup/... is handled for you.

When updating datasets the dataset creation options are not relevant anymore, but another type of options, the dataset open options become relevant and should be passed to GDALOpenEx.

All those options are different kinds and you need to know up-front which options should be passed where...

theroggy avatar Apr 21 '22 17:04 theroggy

gdal configuration options

We have set_gdal_config_options for this, but it wasn't intended to wrap a single call to write.

So at minimum it sounds like we need parameters to allow passing in dataset creation options (maybe dataset_creation_options=dict|None) vs layer creation options, and that the existing kwargs passed now are ambiguous in that they only apply to layer creation options, so we should make those explicit (maybe layer_creation_options=dict|None).

brendan-ward avatar Apr 21 '22 17:04 brendan-ward

If you look at eg https://gdal.org/drivers/vector/gpkg.html or https://gdal.org/drivers/vector/shapefile.html, I think in general the most relevant options are "layer creation options". So personally I would keep the convenience of automatically passing **kwargs to GDALDatasetCreateLayer (but more explicitly document it)

jorisvandenbossche avatar Apr 21 '22 19:04 jorisvandenbossche

It seems that Fiona passes the kwargs to both the dataset and layer creation. And I suppose that GDAL will then ignore the ones that are not relevant?

jorisvandenbossche avatar Apr 21 '22 19:04 jorisvandenbossche

Fiona passes the kwargs to both the dataset and layer creation

This is definitely worth a shot and would allow us to keep using kwargs as-is.

brendan-ward avatar Apr 21 '22 20:04 brendan-ward

It seems that Fiona passes the kwargs to both the dataset and layer creation. And I suppose that GDAL will then ignore the ones that are not relevant?

GDAL writes a warning if an option is passed to the "wrong" function WARNING:gdal:NotSupported: dataset C:\temp\output_options.gpkg does not support layer creation option DATETIME_FORMAT

theroggy avatar Apr 21 '22 23:04 theroggy

If you look at eg https://gdal.org/drivers/vector/gpkg.html or https://gdal.org/drivers/vector/shapefile.html, I think in general the most relevant options are "layer creation options". So personally I would keep the convenience of automatically passing **kwargs to GDALDatasetCreateLayer (but more explicitly document it)

It depends from GDAL driver to GDAL driver. Eg. GeoRSS and GML have mainly dataset creation options. The cases are rare, but I also remember encountering options somewhere that exist both as layer creation option and dataset creation option.

theroggy avatar Apr 21 '22 23:04 theroggy

Hi, I'm not sure if this is the best issue to raise this in but yesterday I tried to use pyogrio to create a GeoPackage with two dataset (rather than layer) creation options set but couldn't see a way to do so, in that all **kwargs were interpreted as layer creation options and I got warnings that they couldn't be found.

My reading of this issue is that it's currently not possible to officially set dataset options. If so is there anyway to unofficially workaround this at the moment?

felnne avatar Nov 14 '22 15:11 felnne

@felnne can you open a new issue for this and include a bit more details on what you've tried and expected to work (including code)?

martinfleis avatar Nov 14 '22 15:11 martinfleis

The original reason this issue was opened has been "solved" in GDAL, by an alternative implementation of index creation that doesn't need extra parameters for increased memory usage: https://github.com/OSGeo/gdal/issues/7614

theroggy avatar Jan 27 '24 19:01 theroggy