python-rasterstats
python-rasterstats copied to clipboard
Question: caching intermediate operations.
Hey hey. Great stuff.
Question: When using python-rasterstats with one polygon and many rasters (or vice versa), do you see a clear spot where intermediate steps can be cached? Examples: the rasterization of the polygon, or the reading of the value raster?
Hey @brendancol,
I'd say rasterstats is designed for the many polygons, one raster scenario so it's already fairly optimal - caching won't help much under that use case since you can already preload the raster into memory and each polygon needs to be rasterized independently.
Caching could potentially help with the one polygon, many rasters scenario. We could cache the rasterized geometry to avoid re-rasterizing. Since rasterizing is a significant chunk of the work (rough 20%?), that would likely be worth the memory footprint of storing them across raster bands.
My work on multiband support has really stalled out: https://github.com/perrygeo/python-rasterstats/issues/73 - there are design barriers internally and numpy behavior that makes it difficult to implement cleanly. But caching rasterized geometries would make a good addition should it ever come to fruition.
We run into the same issue: we have 28 bands, which means that rasterization happens 28 times again. Looking at the code, I wonder if it would not be an option to add an option to use the mini_raster (which we optionally get as an output) as an input to gen_zonal_stats.
Note I'm willing to create a PR, but I'd like to get feedback on the idea before diving into the details.
@johanvdw @brendancol so if the optional mini_raster was supplied, we would skip the rasterization step? At a high level, that seems like a reasonable approach.
You'd still have to call the gen_zonal_stats once to get the rasterized geoms, then 27 more times - the caller would be in charge of managing the rasterized geometries in memory (or elsewhere). And of course the caller would need to ensure the alignment of all 28 raster bands is exactly equal. I think it could work quite nicely without disrupting the current API.