netcdf4-python
netcdf4-python copied to clipboard
Apply scale and offset before checking valid_min/valid_max
Data is in the latest versions masked when outside <valid_min, valid_max>. But apparently this does not work for variables where scale and offset is given, since the valid-checks seem to be performed on the raw (unscaled) data.
This is intentional - the valid_min/valid_max checks are done on the integer data as it is actually stored in the netcdf file. This is consistent with missing_value and _Fill_Value, which also have the native (integer) type when scale/offset packing is used.
Ok, that is good to know. But this then means that there are quite a few netCDF files "out there" where valid_min/valid_max mistakenly refer to the "scaled and offsetted" values, and not the integer values. One example (in my case) is the ROMS ocean model.
Thus a highly welcome feature would be the possibility to use a switch to perform scale/offset before the min/max checking (or rather, scale/offset the min/max-values before masking).
I don't think it makes sense to specify valid_min/valid_max with a type that is different than the netcdf variable. What you're asking for is a floating point valid range for short integer data. I foresee all kinds of floating point comparison issues at the edge of the range.
Do these ROMS files specify the missing_value as a floating point type also?
Her is an example of a variable which is returned as masked values with netCDF 1.3.0:
short Cs_w(s_w) ;
Cs_w:long_name = "S-coordinate stretching curves at W-points" ;
Cs_w:valid_min = -1. ;
Cs_w:valid_max = 0. ;
Cs_w:field = "Cs_w, scalar" ;
Cs_w:scale_factor = -1.52597204419215e-05 ;
Cs_w:add_offset = -0.5 ;
Cs_w should have values between -1 and 0, but the the scaled/offsetted values (shorts) are integers in the range <-32000,32000>, and are thus all being masked. Thus this is perhaps a post-processing issue: the file might have been packed afterwards (ncpdq?), and then the valid_min/max have not been updated as they should?
After setting var.set_auto_mask(False)
in my script, this variable is read properly as before.
However, for some reason this variable in the same file is being read fine:
short u(ocean_time, s_rho, eta_u, xi_u) ;
u:long_name = "time-averaged u-momentum component" ;
u:units = "meter second-1" ;
u:time = "ocean_time" ;
u:coordinates = "lon_u lat_u s_rho ocean_time" ;
u:field = "u-velocity, scalar, series" ;
u:_FillValue = 1.e+37f ;
u:scale_factor = -2.378769e-05f ;
u:add_offset = 0.3410598f ;
whereas e.g. this variable has a lot of masked values, which were not masked in earlier versions (e.g. 1.2.4) of netCDF-library:
short temp(ocean_time, s_rho, eta_rho, xi_rho) ;
temp:long_name = "time-averaged potential temperature" ;
temp:units = "Celsius" ;
temp:time = "ocean_time" ;
temp:coordinates = "lon_rho lat_rho s_rho ocean_time" ;
temp:field = "temperature, scalar, series" ;
temp:_FillValue = 1.e+37f ;
temp:scale_factor = -0.0001446465f ;
temp:add_offset = 3.502546f ;
It seems like values above ca 3.5 (i.e. similar to offset value) are being masked, for some reason. But there is no valid_min/valid_max here?
I see that the _FillValue is 1.e37 - this will get cast to the data type of the variable (short integer), resulting in all the zero values in the array being masked.
_FillValue, missing_value, valid_min, valid_max should all have the same data type as the netcdf variable. This is a bug in the ROMS files.
Ok, then it starts making sense. Thank you for clarifying.
I can see how this can be confusing and can create hard to debug errors. Perhaps we should refuse to do the auto-scaling and masking (and throw and exception instead) if the attribute data types don't match the variable data type.
With pull request #708 a warning will be issued if the _FillValue
cannot be safely cast to the variable data type, and it will not be used to mask the returned data.