xarray
xarray copied to clipboard
`nan` values appearing when saving and loading from `netCDF` due to encoding
What happened?
When writing to and reading my dataset from netCDF using ds.to_netcdf() and xr.open_dataset(...), xarray creates nan values where previously number values (float32) where.
The issue seems related to the encoding used for the original dataset, which causes the data to be stored as short. During loading, the stored values then collide with _FillValue leading to the numbers being interpreted as nan.
What did you expect to happen?
Values after saving & loading should be the same as before saving.
Minimal Complete Verifiable Example
We had a back-and-forth on SO about this, I hope it's fine to just refer to it here:
https://stackoverflow.com/a/75806771/11318472
MVCE confirmation
- [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
- [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
See the SO link above.
Anything else we need to know?
I'm not sure whether this should be considered a bug or just a combination of conflicting features. My current workaround is resetting the encoding and letting xarray decide to store as float instead of short (cf. https://github.com/pydata/xarray/issues/7686).
Environment
INSTALLED VERSIONS
commit: None python: 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 5.15.90.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1
xarray: 2022.11.0 pandas: 1.5.2 numpy: 1.23.5 scipy: 1.10.0 netCDF4: 1.6.2 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: 2.13.6 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.3 cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.02.1 distributed: 2022.2.1 matplotlib: 3.6.2 cartopy: None seaborn: None numbagg: None fsspec: 2022.11.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.1 pip: 22.3.1 conda: None pytest: 7.2.0 IPython: 8.11.0 sphinx: None
Thanks for all the details, @euronion.
From what I can tell, everything is OK with the original file. It's using packed data: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Packed-Data-Values. The only thing what might be a bit off is why they didn't choose -32768 as _FillValue
As both scale_factor and add_offset are of dtype float64 in the original file the data should be unpacked to float64 according to NetCDF-specs.
The reason why this isn't done is because the
https://github.com/pydata/xarray/blob/020b4c07047189c5c788eca9e6e77d64b8989d58/xarray/conventions.py#L379-L384
CFMaskCoder will promote int16 to float32 unconditionally. This happens in dtypes.maybe_promote():
https://github.com/pydata/xarray/blob/020b4c07047189c5c788eca9e6e77d64b8989d58/xarray/core/dtypes.py#L67-L70
The CFScaleOffsetCoder itself is able to correctly convert this to the wanted dtype float64:
https://github.com/pydata/xarray/blob/e79eaf5acdcda62f27ce81f08e7e71839887d3d1/xarray/coding/variables.py#L235-L251
As this doesn't surface that often it might just happen here by accident. If the _FillValue/missing_value would be -32768 then the issue would not manifest.
Update: corrected to maybe_promote()
As this doesn't surface that often it might just happen here by accident. If the
_FillValue/missing_valuewould be-32768then the issue would not manifest.
So for NetCDF the default fillvalue for NC_SHORT (int16) is -32767. That means the promotion to float32 instead the needed float64 is the problem here (floating point precision).
MCVE:
fname = "test-7691.nc"
import netCDF4 as nc
with nc.Dataset(fname, "w") as ds0:
ds0.createDimension("t", 5)
ds0.createVariable("x", "int16", ("t",), fill_value=-32767)
v = ds0.variables["x"]
v.set_auto_maskandscale(False)
v.add_offset = 278.297319296597
v.scale_factor = 1.16753614203674e-05
v[:] = np.array([-32768, -32767, -32766, 32767, 0])
with nc.Dataset(fname) as ds1:
x1 = ds1["x"][:]
print("netCDF4-python:", x1.dtype, x1)
with xr.open_dataset(fname) as ds2:
x2 = ds2["x"].values
ds2.to_netcdf("test-7691-01.nc")
print("xarray first read:", x2.dtype, x2)
with xr.open_dataset("test-7691-01.nc") as ds3:
x3 = ds3["x"].values
print("xarray roundtrip:", x3.dtype, x3)
netCDF4-python: float64 [277.9147410535744 -- 277.9147644042972 278.67988586425815
278.297319296597]
xarray first read: float32 [277.91476 nan 277.91476 278.6799 278.29733]
xarray roundtrip: float32 [ nan nan nan 278.6799 278.29733]
I've confirmed that correctly promoting to float64 in CFMaskCoder solves this issue.
It would be good to merge some version of #6812. This seems to be pretty common
Review comments and PR remixes welcome!
@euronion There is a potential fix for your issue in #7654. It would be great, if you could have a closer look and test against that PR.
@euronion There is a potential fix for your issue in #7654. It would be great, if you could have a closer look and test against that PR.
Yes, the PR seems to solve my specific issue without changing the encoding in the netCDF file. Great, thanks!
, the PR seems to solve my specific issue without changing the encoding
Great, thanks for testing.
#7654 got closed without adding several changes to other PR. #8713 takes over.
@euronion Sorry for letting go #7654 out of focus. Would you mind looking/testing #8713 and let us know your thoughts over there? Much appreciated!
Hi @kmuehlbauer , thanks for looking into the issue again! I'm moving between jobs right now and have to re-setup my coding / testing locally. No promises on a fast response, maybe in 1-2 weeks.
Thanks @euronion, no worries, just have a look when you can spare the time.