python-rasterstats icon indicating copy to clipboard operation
python-rasterstats copied to clipboard

Make int64 casting optional

Open maxfenv opened this issue 3 years ago • 3 comments
trafficstars

With 'large' raster datasets of integer datatype, rasterstats wastefully casts the ndarray read from the raster up to int64. In the extreme case, this uses 8 times more memory than necessary if the underlying datatype of the raster dataset is Byte.

At least for categorical rasters and when using categorical=True, it seems desirable to disable this behaviour to save memory.

In a not particularly extreme case, this saved us on the order of 20G of memory:

The raster datasets covers all of Scotland and is of the Byte type. The individual features in the vector dataset are the Scottish country boundaries and therefore cover a decent portion of the raster area (and their bounding box an even larger portion).

The raster dataset is 46478 * 69365 pixels, 1 byte per pixel = 3.00GiB uncompressed in memory if stored in an ndarray of dtype int8. It would be 24GiB in int64.

maxfenv avatar Oct 25 '22 10:10 maxfenv

@sebastianclarke for visibility.

Our current hack to reduce memory usage is monkey patching rasterstats.main.sys.maxsize = 2**32, which is not ideal.

maxfenv avatar Oct 25 '22 10:10 maxfenv

These numpy stats functions accept a dtype argument that determines the dtype of the accumulator for functions like mean, sum, etc (see: https://numpy.org/doc/stable/reference/generated/numpy.sum.html). When calculating the mean, for example, we can force that calculation to happen with int64 without having to cast the input to int64 by doing masked.mean(dtype='int64') instead of masked.mean()

Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either).

groutr avatar Oct 25 '22 19:10 groutr

Adding a keyword argument should not be necessary (and casting the input to int64 shouldn't be necessary either).

I would tend to agree that casting seems unnecessary, but I'm insufficiently familiar with the operations happening here to be sure. Certainly for categorical rasters, it seems unlikely that any value in the array should overflow the datatype in the raster dataset. With non categorical rasters, I don't know. Maybe some mathematical operations happen on the data that would cause an overflow?

I can only assume there was a good reason for introducing the cast in the first place.

maxfenv avatar Oct 26 '22 08:10 maxfenv

@groutr Thanks for the tip - the dtype accumulator might be exactly what we need here - the array stays in its native dtype but the result can't overflow.

@maxfenv if you'd like, I can implement the above as an alternative to this PR. It should be the best of both worlds rather than having a boolean flag.

perrygeo avatar Jan 15 '23 04:01 perrygeo

@perrygeo Yeah go ahead with that if you prefer and you have the time! It is certainly cleaner.

maxfenv avatar Jan 16 '23 09:01 maxfenv