python-rasterstats
python-rasterstats copied to clipboard
nodata count includes non-overlapping cells
For each feature, there are three types of cells: data, nodata, and non-overlapping. Nodata and non-overlapping get lumped together in the nodata statistic.
Was this always the case? Was the bug introduced in an attempt to optimize?
- [ ] write a failing test case for the partial overlap case
- [ ] fix it
Also, let's consider nans.
Here's an example, crudely drawn. The raster is 3 pixels tall with a nodata value of 0, the geometry covers 6 pixels.

So the breakdown of cell counts should be
- 1 valid data cell (
count) - 1 nodata cell (
nodata) - 1 nan cell
- 3 non-overlapping cells
The problem currently is that a) we don't track nan cell counts at all and b) the non-overlapping ones get lumped into nodata
Still debating if it's better to do
{'count': 1, 'nodata': 5}
# or
{'count': 1, 'nodata': 1, 'nan': 1, 'no_overlap': 3}
IOW is there really a need to distinguish between the three types of invalid data cells?
i would track each type and then let the user decide how they are returned. group them by default with an option to return them separately
Implementation is a little trickier than expected, deferring this to the 0.12 milestone
I do not think it's a problem and we need not count non-overlapping pixels. From my understanding, the non-overlapping is not a part of the raster. From above picture, the raster is inside the red box and black is nodata. The blue box is the geometry which covers 8 pixels. So:
{'count': 2, 'nodata': 6}
The white pixels do not belong to the raster.
In your example, I prefer this result:
{'count': 1, 'nodata': 1, 'nan': 1}
We can not get the result like this:
{'count': 1, 'nodata': 5}
looks like the non overlapping areas get lumped into nodata due to the the boundless options (boundless_array func in rasterstats for arrays, and equivalent steps in rasterio for rasters) which either initialize the boundless output with the nodata value (boundless_array) or fill it with the nodata value before it gets returned (rasterio)
this seems to be when it was introduced for arrays, rasters were probably earlier https://github.com/perrygeo/python-rasterstats/commit/e50d89b1f8d1769e7e5b026cf8d5f4e7662cd929
this might work: if the user wants no_overlap count returned, they need to provide a second, unused "fake" nodata value. this value is passed to rasterio as the raster's nodata value which results in the true nodata value being treated as just another value. we then have distinct values for no_overlap vs nodata and just need to update our mask accordingly
in the case when the user does not request the no_overlap field, it just behaves exactly as it currently does
should be able to tweak the boundless_array func to do the same thing for np arrays
PR #146 adds 'nan' stat option
rasterio issue on shapes extending beyond raster extent https://github.com/mapbox/rasterio/issues/995
@perrygeo almost done with a no overlap stat proof of concept that works with array rasters. but i think there may be some limitations to getting it working with rasters opened via rasterio
Mine shapefile polygon is not snapped with raster when i run zonal_stats results are not exactly matching with Arc GIS result. I am using Python 3.7 and pygeoprocessing 1.5. Is there any way by which we can match Arc GIS result?