netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

What's the best way to prevent effect of `valid_min` et al.?

Open willirath opened this issue 7 years ago • 9 comments

The fact that with v1.2.9 valid_min etc may have severe effects whereever people have been sloppy with these attributes. (In my experience, there are a lot of these files.)

Is there an easy way to make sure the new standard behaviour is not taking effect? My understanding is that set_auto_maskandscale is set at creation of the data set. Is that correct?

Note that I'm not advocating the sloppyness I described. But there's a lot of data we have to use without being able to make sure they comply to the standards.

(related to #703)

willirath avatar Sep 19 '17 19:09 willirath

Pull request #708 (which will be in 1.3.0) tries to address this by not using valid_min/valid_max/missing_value/_FillValue if the values of those attributes cannot be safely cast to the variable dtype, and instead issuing a warning. Does this work for you? If so, I'll expedite a release. If not, let's discuss what else could be done.

jswhit avatar Sep 19 '17 21:09 jswhit

Note that set_auto_mask can be used at any time (not just at variable creation time) to disable the use of valid_min/valid_max/missing_value to return masked arrays.

jswhit avatar Sep 19 '17 21:09 jswhit

Note that set_auto_mask can be used at any time (not just at variable creation time) to disable the use of valid_min/valid_max/missing_value to return masked arrays.

Thanks. I think this is what I was looking for.

willirath avatar Sep 20 '17 08:09 willirath

Pull request #708 (which will be in 1.3.0) tries to address this by not using valid_min/valid_max/missing_value/_FillValue if the values of those attributes cannot be safely cast to the variable dtype, and instead issuing a warning. Does this work for you? If so, I'll expedite a release. If not, let's discuss what else could be done.

This definitely is a good thing. I'm more worried about cases where data sets suddenly stop working, though.

(Disclaimer: I'm definitely biased towards pessimism about this feature after spending a day debugging weird behaviour of numpy.ma until I realised that someone thought it's a good idea to put valid_min=0 into a latitude field. :) )

willirath avatar Sep 20 '17 10:09 willirath

missing_value and _FillValue were always cast to the variable dtype - in 1.2.9 valid_min and valid_max were implemented. The fact that this suddenly broke so many files for you suggests that folks have been mis-using valid_min and valid_max more often than missing_value for some reason.

jswhit avatar Sep 20 '17 12:09 jswhit

Note that set_auto_mask can be used at any time (not just at variable creation time) to disable the use of valid_min/valid_max/missing_value to return masked arrays.

Thanks. I think this is what I was looking for.

From a quick test with ASCAT wind data (which also seem to suffer from #703), it looks like Dataset.set_auto_mask(False) also disables masking according the missing value attribute?

willirath avatar Sep 21 '17 16:09 willirath

@jswhit: I'm quite busy atm, so I'll just pin netCDF4 to v1.2.8 in all my envs to be safe in the short run.

If you're interested, I'd be, however, definitely willing to contribute some regression tests and data sets suffering from either broken valid ranges or from #703 to more formally assess how to deal with these edge cases.

willirath avatar Sep 21 '17 16:09 willirath

Yes, Dataset.set_auto_scale(False) disables all masking. I have a regression test already for broken valid_range.

jswhit avatar Sep 21 '17 17:09 jswhit

We are also affected by the change in 1.2.9 concerning the automatic masking of values outside of the valid_range. And I agree that the new behavior is the right way to handle these attributes, but, unfortunately, we have to deal with lots of files where these attributes are set wrongly. Using set_auto_mask to false is not a good option for us, because this also disables the automatic conversion for the _FillValue. To re-implement all the logic of handling missing values in the client code seems not a good idea, since it is already handled nicely in the library.

Is it therefore possible to provide an additional function to prevent the automatic conversion for the newly implemented valid_range masking? I would also prefer if that would not be needed, but I cannot think of any elegant way of making all these files work again ... Thanks for considering!

floogit avatar May 02 '18 10:05 floogit