netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

Unsigned byte according to CF conventions

Open cpaulik opened this issue 8 years ago • 10 comments

Reading Section 2.2 of the CF conventions I gather that I can only use np.byte datatypes but if valid_range is using np.ubyte values then the byte data should be interpreted as unsigned.

This is currently not the case in this library. If we want to support this during auto_mask_and_scale then I'm happy to look into it.

cpaulik avatar Dec 01 '15 10:12 cpaulik

You can use unsigned data types when the file is created with the NETCDF4 flag, or even in the old format if the file is created using the new NETCDF3_64BIT_DATA flag available in version 1.2.2 (4.4.0 of the C lib). So, unless I'm missing something, it seems the CF section you refer to is out of date, and is trying to work around a problem that no longer exists.

jswhit avatar Dec 01 '15 13:12 jswhit

https://github.com/cf-convention/CF-2/issues/3

jswhit avatar Dec 01 '15 13:12 jswhit

I guess I see your point though - there are probably a lot of files out there with byte variables that should be interpreted as unsigned. Are you proposing we check valid_range and returned an np.uint8 array if valid_range indicates the data is unsigned?

Not sure where auto_mask_and_scale enters into it...

jswhit avatar Dec 01 '15 13:12 jswhit

Writing ubyte is very possible with the library. The problem I'm having is that I work in a project where the data has to be CF-conform according to the Compliance Checker so I can not use ubyte even though it would be technically possible and sensible.

Long story short: Yes I do propose that we check valid_range, valid_min and valid_max attributes and return np.uint8 if these attributes have unsigned data type.

I guess the decision would be: if both attributes of valid_range are uint8 or if both valid_min and valid_max are uint8 then return uint8 Should this be the default or is it too likely that this conversion would break somebodies code?

On second thought auto_mask_and_scale does not really enter into it since it is only for offset and scale factor.

cpaulik avatar Dec 01 '15 15:12 cpaulik

Do you want np.uint8 if if valid_range is unsigned, or if valid_min > 0? The only thing that concerns me about this is that up until now we have left metadata conventions to be the concern of downstream applications, and let netcdf4-python handle the general low-level detail of reading and writing. However, in this case I don't see the harm of returning an unsigned numpy array if the valid_min/valid_max indicate that the data should fit. What is the real benefit though - client code could always cast the array to uint as needed, right?

jswhit avatar Dec 01 '15 18:12 jswhit

Long story short: Yes I do propose that we check valid_range, valid_min and valid_max attributes and return np.uint8 if these attributes have unsigned data type.

I am :-1: on returning data with a different data type to follow CF conventions, unless the option can be toggled on/off. For tools like xray, we definitely don't want to be using this behavior in netCDF4. My preference would be for such new behavior to be opt in, because it would be slightly trickier to disable it otherwise in a cross-version compatible manner.

The only thing that concerns me about this is that up until now we have left metadata conventions to be the concern of downstream applications, and let netcdf4-python handle the general low-level detail of reading and writing.

This is exactly my perspective :+1:.

shoyer avatar Dec 02 '15 18:12 shoyer

I am also a :+1: to leave any convention, beyond those that define what a netCDF file is, to the downstream applications.

ocefpaf avatar Dec 02 '15 18:12 ocefpaf

That is fine with me. But the question is then what of the netCDF Attribute conventions should be implemented?

Should we follow http://www.unidata.ucar.edu/software/netcdf/docs/netcdf.html ? If so then the section about Attribute Conventions specifies the same implicit conversion in the section about signedness which says

Deprecated attribute, originally designed to indicate whether byte values should be treated as signed or unsigned. The attributes valid_min and valid_max may be used for this purpose. For example, if you intend that a byte variable store only non-negative values, you can use valid_min = 0 and valid_max = 255

cpaulik avatar Dec 02 '15 19:12 cpaulik

Anyway. If we want this then it should definitely be opt-in.

cpaulik avatar Dec 02 '15 19:12 cpaulik

Generally the netcdf libraries dont use the attribute conventions (except for the reserved _ (underscore) attributes). its up to the "user" to handle that. in java we have a wrapper class that will take into account attribute conventions, so the user can get it with or without.

JohnLCaron avatar Dec 02 '15 23:12 JohnLCaron