rioxarray icon indicating copy to clipboard operation
rioxarray copied to clipboard

Document GDAL_CACHEMAX

Open jennan opened this issue 3 years ago • 1 comments
trafficstars

Hello,

First, thank you so much for this very cool project :). It looks like this is not documented elsewhere but I realized that the GDAL_CACHEMAX default can have a detrimental effect when using rioxarray in an HPC environment, with nodes having 100s of GBs of RAM.

More precisely, by default GDAL will use 5% of the total RAM as cache (see https://trac.osgeo.org/gdal/wiki/ConfigOptions#GDAL_CACHEMAX), which adds up quickly when using multiple workers in a job with limited memory (i.e. an job that requests much less memory than a whole node). It took me a while to figure out where the wild memory consumption of my workers was coming from (checking the Dask Dashboard), so I thought that it could be a relevant information to expose to users of rioxarray, even if the reason is the underlying GDAL library.

I am would be happy to make a pull request for this, don't hesitate to tell me the part of the documentation where you feel this could fit (if you think that should go in rioxarray documentation). One idea could be to add it as an example notebook (it takes only a %env GDAL_CACHEMAX=64 at the beginning to make it use less memory, even in distant workers)?

And again, thank you for this toolbox, it's very neat :).

P.S: I am using the KEA format backend in GDAL.

jennan avatar Mar 13 '22 23:03 jennan

I am would be happy to make a pull request for this

Sure, I think that would be helpful. I am wondering if we should add a FAQ/Gotchas page for rioxarray where we can capture common tips that would be useful for rioxarray users similar to pyproj ref. What are your thoughts?

snowman2 avatar Mar 14 '22 13:03 snowman2