datacube-core icon indicating copy to clipboard operation
datacube-core copied to clipboard

nodata for product definition doesnt match dtype

Open pindge opened this issue 1 year ago • 15 comments

related to: https://github.com/GeoscienceAustralia/dea-config/issues/1110

247 | ga_srtm_dem1sv1_0 | {"platform": {"code": "Space Shuttle Endeavour"}, "instrument": {"name": "SIR"}, "product_type": "DEM"} |                 1 | {"name": "ga_srtm_dem1sv1_0", "storage": {"crs": "EPSG:4326", "resolution": {"latitude": -0.00027777777778, "longitude": 0.00027777777778}}, "metadata": {"platform": {"code": "Space Shuttle Endeavour"}, "instrument": {"name": "SIR"}, "product_type": "DEM"}, "description": "DEM 1sec Version 1.0", "measurements": [{"name": "dem", "dtype": "float32", "units": "metre", "nodata": -340282350000000000000000000000000000000}, {"name": "dem_s", "dtype": "float32", "units": "metre", "nodata": -340282350000000000000000000000000000000}, {"name": "dem_h", "dtype": "float32", "units": "metre", "nodata": -340282350000000000000000000000000000000}], "metadata_type": "eo"} | 2020-09-25 04:29:16.760552+00 | ows      | 2021-07-01 01:07:29.571853+00
(1 row)

eo3-validate type handling needs update

pindge avatar Nov 07 '22 23:11 pindge

Our validator is correctly failing the product --- should this issue go to ODC core?

jeremyh avatar Nov 07 '22 23:11 jeremyh

@SpacemanPaul suggested this is not core's issue, it was just transferred here from core

pindge avatar Nov 07 '22 23:11 pindge

I'm not that strongly opinionated on the matter.

What would the metadata fix be though? -340282350000000000000000000000000000000 -> -3.4028235e+38 ?

What does a nodata value on a floating point variable even mean?

SpacemanPaul avatar Nov 07 '22 23:11 SpacemanPaul

nodata supported type can be anything as per odc schema https://github.com/opendatacube/datacube-core/blob/develop/datacube/model/schema/dataset-type-schema.yaml#L97-L100

nodata:
              oneOf:
                - type: number
                - enum: [NaN, Inf, -Inf]
                -

dtype restriction https://github.com/opendatacube/datacube-core/blob/develop/datacube/model/schema/dataset-type-schema.yaml#L54-L55

pindge avatar Nov 09 '22 00:11 pindge

https://github.com/opendatacube/datacube-core/pull/1347

pindge avatar Nov 09 '22 03:11 pindge

What does a nodata value on a floating point variable even mean?

same as for an int: bit pattern reserved to mean absence of valid observation, just cause float32 already has standard bit-patterns that mean exactly that, doesn't stop some from using some finite values instead and sometimes along with proper float-point NaNs. All masking code is significantly complicated due to float nodata that could be nan or some finite float value, or absent, which is the same as nan.

Kirill888 avatar Nov 09 '22 04:11 Kirill888

The real value is probably: -340282346638528859811704183484516925440 which is numpy.finfo(numpy.float32).min, but it lost precision when converting to string via scientific notation.

Our validator is correctly failing the product --- should this issue go to ODC core?

@jeremyh

test in numpy_value_fits_dtype

        return np.all(np.array([value], dtype=dtype) == [value])

doesn't check what it claims, it checks that value can round-trip to numpy and back, and that's only appropriate for integer types. Value as supplied, is probably incorrect because of precision loss, but it is surely not outside of valid range for float32 number.

Kirill888 avatar Nov 09 '22 05:11 Kirill888

checking the dataset in middle of the ocean

import datacube
dc = datacube.Datacube()
x = 136.43551
y = -37.25564
buffer_x = 0.03
buffer_y = 0.03
query = dict(
    product="ga_srtm_dem1sv1_0",
    x=(x - buffer_x, x + buffer_x),
    y=(y - buffer_y, y + buffer_y),
)
ds = dc.load(**query)
ds

the output of stored nodata

dem
(time, latitude, longitude)
float32
-3.403e+38 ... -3.403e+38
array([[[-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        ...,
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38]]], dtype=float32)
dem_s
(time, latitude, longitude)
float32
-3.403e+38 ... -3.403e+38
array([[[-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        ...,
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38]]], dtype=float32)
dem_h
(time, latitude, longitude)
float32
-3.403e+38 ... -3.403e+38
array([[[-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        ...,
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38],
        [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, ...,
         -3.4028235e+38, -3.4028235e+38, -3.4028235e+38]]], dtype=float32)

pindge avatar Nov 09 '22 21:11 pindge

ds.dem[0][1][1]
xarray.DataArray'dem'
array(-3.4028235e+38, dtype=float32)
Coordinates:
time
()
datetime64[ns]
2014-12-15T14:58:44
latitude
()
float64
-37.23
longitude
()
float64
136.4
spatial_ref
()
int32
4326
Attributes:
units :
metre
nodata :
-340282350000000000000000000000000000000
crs :
EPSG:4326
grid_mapping :
spatial_ref

pindge avatar Nov 09 '22 21:11 pindge

Example of datacube's mask_invalid_data function failing to mask nodata for this product. If this worked as intended, I would expect to see all -3.4028235e+38 values in the dataset masked out and set to np.NaN:

import datacube
from datacube.utils.masking import mask_invalid_data

dc = datacube.Datacube()
x = 136.43551
y = -37.25564
buffer_x = 0.03
buffer_y = 0.03
query = dict(
    product="ga_srtm_dem1sv1_0",
    x=(x - buffer_x, x + buffer_x),
    y=(y - buffer_y, y + buffer_y),
)
ds = dc.load(**query)

# Attempt to mask invalid data
ds_masked = mask_invalid_data(ds.dem_h)
ds_masked

Nodata values are not correctly replaced with np.NaN: image

robbibt avatar Nov 09 '22 22:11 robbibt

@robbibt can you try patching .nodata value on xarray variable to -340282346638528859811704183484516925440 and then try masking again? Also value above, might pass validation as it should round-trip to/from numpy without loss.

Kirill888 avatar Nov 10 '22 01:11 Kirill888

What about nodata tags in the imagery itself, what are they set to? Also is nodata coming from files or is it coming from load, load will place whatever incorrect rounded value recorded in the product spec, and then separate to that data itself might contain non-rounded values of nodata.

Kirill888 avatar Nov 10 '22 01:11 Kirill888

@Kirill888 That works!

import datacube
from datacube.utils.masking import mask_invalid_data

dc = datacube.Datacube()
x = 136.43551
y = -37.25564
buffer_x = 0.03
buffer_y = 0.03
query = dict(
    product="ga_srtm_dem1sv1_0",
    x=(x - buffer_x, x + buffer_x),
    y=(y - buffer_y, y + buffer_y),
)
ds = dc.load(**query)

# Attempt to mask invalid data
ds.dem_h.attrs['nodata'] = -340282346638528859811704183484516925440
ds_masked = mask_invalid_data(ds.dem_h)
ds_masked

image

What about nodata tags in the imagery itself

Not sure what you mean by this, do you mean the nodata attributes on the different bands of the dataset?

robbibt avatar Nov 10 '22 02:11 robbibt

Not sure what you mean by this, do you mean the nodata attributes on the different bands of the dataset?

Is there nodata marker on the TIFF image itself (rio info path_to_image), and is it set to a correct value or is it truncated, like in the product definition.

Kirill888 avatar Nov 10 '22 05:11 Kirill888

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 18 '23 04:03 stale[bot]