datacube-core icon indicating copy to clipboard operation
datacube-core copied to clipboard

Thoughts on adding a 'mask_and_scale' option to the 'load()' call?

Open snowman2 opened this issue 5 years ago • 6 comments

https://github.com/opendatacube/datacube-core/blob/bdf459949ce18689a3be87350919f406e3376157/datacube/api/core.py#L137-L140

If rioxarray.open_rasterio we added a mask_and_scale option for loading in data. It is also something supported by xarray.open_dataset() and is useful to have. Thoughts on adding the option to the load() method?

snowman2 avatar Dec 16 '19 18:12 snowman2

An example of a dataset we use that includes a scale factor is gridMET, downloads/docs here.

alfredoahds avatar Dec 16 '19 18:12 alfredoahds

An example of a dataset we use that includes a scale factor is gridMET, downloads/docs here.

Here is a small test file we use in rioxarray with scale/offset based on a file from gridMET: tmmx_20190121.nc.

snowman2 avatar Dec 16 '19 18:12 snowman2

We talked about this internally a few times. There are some complications for datacube that might make it a bit trickier in our case.

  • Datacube deals with multiple, possibly heterogeneous datasets, scale/offset for one might not be what you want for another
  • Datacube loads multiple bands, scale/offset should be configurable per band
  • Datacube load can change projection on the fly, should scale apply before or after projection change?
  • Datacube fuses multiple datasets into one raster plane, should conversion happen before or after fusing?

One option is basically netcdf driver specific implementation that obtains scaling factors from the file and performs scale/offset on the fly, this would happen before anything else, as if files contained scaled offset values to begin with. Problem is scale/offset changes data type from, say int16 to float32|float64, so if you want both native some times and scaled other times that becomes tricky as you now need to model dtype of a band as a dynamic property that depends on user preferences.

I'll write more on this a bit later in the day.

Kirill888 avatar Dec 16 '19 21:12 Kirill888

Sorry didn't get back to it as promised @snowman2.

I see this as a driver specific functionality. We can implement "always on scaling" quite easily. One would need to declare measurement to be of dtype=float32 and we will need to add some extra checks in the loading code to optionally perform scaling/nodata conversion for netcdf sources with correct attributes set.

One could expose the same "storage-level" netcdf values as scaled and native. Expose same variable twice with two different names and dtypes. Only float32|float64 will be scaled and integer dtypes like int16 won't be.

As for "scaling by default/native when requested" functionality, this is much harder. This is because:

  1. We currently have no mechanism to pass on "load-time context" to IO driver, so even passing an option into driver is not possible currently and will require breaking change to implement.
  2. This would mean supporting "dynamic" dtype, band is sometimes float32 and other times it's int16.

So this will probably have to wait until ODC 2.0 work. I do think that having "load-time" pixel transform is a useful thing to have and it needs same infrastructure: load-time context + context dependent dtype.

For right now have a look here:

https://github.com/opendatacube/odc-tools/blob/master/libs/algo/odc/algo/_masking.py#L117

Provides scaling/masking, supports dask input/outputs. You still need to figure out scaling parameters from the file or some side-channel though...

Kirill888 avatar Dec 18 '19 22:12 Kirill888

@Kirill888 I like your ideas for the functionality for always on scaling. Thoughts on implementing that now and then in ODC 2.0 adding the option to disable it? I am not sure I like the idea of double loading data though.

For reference, the scales and offsets on a per-band basis can be obtained from rasterio (code). We then use them on load here. Also, bits of code handling the Unsigned case here based on the _Unsigned attribute.

snowman2 avatar Dec 19 '19 14:12 snowman2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 08 '20 06:08 stale[bot]