Use smaller internal data format when possible
From looking at some examples, it appears that data is always loaded to float64 arrays. For example in https://github.com/gjoseph92/stackstac/blob/5f984b211993380955b5d3f9eba3f3e285f6952c/examples/show.ipynb, loading the RGB bands of a Sentinel 2 asset (rgb = stack.sel(band=["B04", "B03", "B02"]).persist() ) creates an xarray dataset of type float64. It seems to me that you could improve performance (or at least memory usage) if you were able to use a smaller data type when possible.
You could look at the raster:bands object if it exists to optimize the xarray data type. If the extension doesn't exist, or if the bands have mixed dtypes, then fall back to float64?
Using float64 by default was an intentional choice because
-
raster:bandsdidn't exist when I wrote everything a few months ago, so there was no way to know without actually fetching data what the native dtype of the asset would be. But we have to know that ahead of time to correctly construct the dask array. So float64 seemed like the safest default, since anything else could lose precision. -
rescale=Trueby default, which uses thescale_offsetmetadata defined within each GeoTIFF (not known within the STAC metadata) to apply rescaling. So even if the asset were uint16 to begin with, it could become float64 after applying rescaling—yet another reason why that default made sense.However from what I've seen, nobody really sets the
scale_offsetmetadata at the GeoTIFF level, so I think this might be reasonable to remove. It would make thinking about dtypes a lot easier.
Note that you can control the dtype using the dtype= parameter to stackstac.stack. You'll also want to set rescale=False if doing this, as noted in the docs.
I'd really like to make this automatic though. I think raster:bands is the missing link to allow us to do that. Having data_type, scale, offset, and nodata in metadata really changes the game!