python-rasterstats
python-rasterstats copied to clipboard
Make int64 casting optional
With 'large' raster datasets of integer datatype, rasterstats wastefully casts the ndarray read from the raster up to int64. In the extreme case, this uses 8 times more memory than necessary if the underlying datatype of the raster dataset is Byte.
At least for categorical rasters and when using categorical=True, it seems desirable to disable this behaviour to save memory.
In a not particularly extreme case, this saved us on the order of 20G of memory:
The raster datasets covers all of Scotland and is of the Byte type. The individual features in the vector dataset are the Scottish country boundaries and therefore cover a decent portion of the raster area (and their bounding box an even larger portion).
The raster dataset is 46478 * 69365 pixels, 1 byte per pixel = 3.00GiB uncompressed in memory if stored in an ndarray of dtype int8. It would be 24GiB in int64.
@sebastianclarke for visibility.
Our current hack to reduce memory usage is monkey patching rasterstats.main.sys.maxsize = 2**32, which is not ideal.
These numpy stats functions accept a dtype argument that determines the dtype of the accumulator for functions like mean, sum, etc (see: https://numpy.org/doc/stable/reference/generated/numpy.sum.html).
When calculating the mean, for example, we can force that calculation to happen with int64 without having to cast the input to int64 by doing masked.mean(dtype='int64') instead of masked.mean()
Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either).
Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either).
I would tend to agree that casting seems unnecessary, but I'm insufficiently familiar with the operations happening here to be sure. Certainly for categorical rasters, it seems unlikely that any value in the array should overflow the datatype in the raster dataset. With non categorical rasters, I don't know. Maybe some mathematical operations happen on the data that would cause an overflow?
I can only assume there was a good reason for introducing the cast in the first place.
@groutr Thanks for the tip - the dtype accumulator might be exactly what we need here - the array stays in its native dtype but the result can't overflow.
@maxfenv if you'd like, I can implement the above as an alternative to this PR. It should be the best of both worlds rather than having a boolean flag.
@perrygeo Yeah go ahead with that if you prefer and you have the time! It is certainly cleaner.