xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Reading data after saving data from masked arrays results in different numbers

Open adriaat opened this issue 1 year ago • 8 comments

I hit an issue that can go silent for many users.

I am masking my data using dask.array.ma.masked_invalid (my dataset is much larger than my memory). When saving the results and loading them again, the data is changed. In particular, 0. is assigned to those elements that where masked instead of np.nan or fill_value, which is 1e+20.

Below an example that illustrates the issue:

In [1]: import dask.array as da
   ...: import numpy as np
   ...: import xarray as xr

In [2]: ds = xr.DataArray(
   ...:     da.stack(
   ...:         [da.from_array(np.array([[np.nan, np.nan], [np.nan, 2]])) for _ in range(
   ...: 10)],
   ...:         axis=0
   ...:     ).astype('float32'),
   ...:     dims=('time', 'lat', 'lon')
   ...: ).to_dataset(name='mydata')

In [3]: # Mask my data

In [4]: ds = xr.apply_ufunc(da.ma.masked_invalid, ds, dask='allowed') 

In [5]: ds.mean('time').compute()
Out[5]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B nan nan nan 2.0

In [6]: # Write to file

In [7]: ds.mean('time').to_netcdf('foo.nc')

In [8]: # Read foo.nc

In [9]: foo = xr.open_dataset('foo.nc')

In [10]: foo.compute()
Out[10]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B 0.0 0.0 0.0 2.0

I expected mydata to be either [np.nan, np.nan, np.nan, 2.0], numpy.MaskedArray, or [1e+20, 1e+20, 1e+20, 2.0], since:

In [11]: ds.mean('time')['mydata'].data.compute()
Out[11]: 
masked_array(
  data=[[--, --],
        [--, 2.0]],
  mask=[[ True,  True],
        [ True, False]],
  fill_value=1e+20,
  dtype=float32)

instead of [0.0, 0.0, 0.0, 2.0]. No warning is raised, and I do not get why the fill values are replaced by 0.

A way to address this for my data (but might not be true for all cases) is to use xarray.where before writing to file as

In [12]: xr.where(ds.mean('time'), ds.mean('time'), np.nan).to_netcdf('foo.nc')

In [13]: foo = xr.open_dataset('foo.nc')

In [14]: foo.compute()
Out[14]: 
<xarray.Dataset> Size: 16B
Dimensions:  (lat: 2, lon: 2)
Dimensions without coordinates: lat, lon
Data variables:
    mydata   (lat, lon) float32 16B nan nan nan 2.0

or mask the data in some other way, e.g. using xarray.where in the beginning instead of xarray.apply_ufunc.

Output of xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:36:51) [GCC 12.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.153.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2024.9.0 pandas: 2.2.2 numpy: 2.1.1 scipy: None netCDF4: 1.7.1 pydap: None h5netcdf: None h5py: None zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: None dask: 2024.9.0 distributed: 2024.9.0 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2024.9.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 73.0.1 pip: 24.2 conda: None pytest: None mypy: None IPython: 8.27.0 sphinx: None

adriaat avatar Aug 17 '24 13:08 adriaat

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Aug 17 '24 13:08 welcome[bot]

I can't reproduce this in my environment, I get the expected masked values (thanks for the code snippet, though).

Can you please add the output of xr.show_versions() to the original post? I suspect this issue to be caused by a outdated, or possibly broken, environment.

keewis avatar Aug 28 '24 18:08 keewis

Done. I also tested that the problem persist as of 2024-09-16, at least with my setup, by creating a new conda environment, installing ipython, xarray, numpy, dask, and netCDF4 from the conda-forge channel, and executing the snippet in this clean environment.

adriaat avatar Sep 16 '24 09:09 adriaat

@adriaat I have problems understanding your use case, please bear with me. If you have NaN in your data, why do you want to apply an additional mask? Do I get it correct, that you want something like that:

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.DataArray(
     da.stack(
         [da.from_array(np.array([[np.nan, np.nan], [np.nan, 2]])) for _ in range(10)], axis=0
     ).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))

This will actually apply your wanted _FillValue to be used as on-disk value using CF attribute.

Update: added np.float32 wrapper

kmuehlbauer avatar Sep 16 '24 13:09 kmuehlbauer

@kmuehlbauer I may have confused you with my simple snippet, which I constructed to highlight the issue. To answer your question, my snippet is a simplification of a problem I worked with, where I had a mix of NaN's and inf's; in my problem they were equivalent for what I wanted to compute. If I did not apply any additional masking and simply called ds.mean('time'), then I would get inf in many places where I should get a finite values. We can think of other workarounds for not using the mask, i.e. replace inf's with NaN's, if the size of the dataset allows it.

In any case, what I wanted to share (and the reason for which I opened the issue) was to highlight that when mydata is the result of a masked array, then if it has NaNs these are saved (or read?) as 0 instead of NaNs. I might be missing something...

Using your snippet works, I think, because you define mydata as an array, not a masked array. If you instead use a masked array in your simple example, then we recreate the behaviour of my snippet:

ds = xr.DataArray(
     da.stack(
         [da.from_array(np.ma.masked_array(np.array([[np.nan, np.nan], [np.nan, 2]]), np.array([[True, True], [True, False]]))) for _ in range(10)], axis=0
     ).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))
print(xr.open_dataset('foo.nc').compute())
# <xarray.Dataset>
# Dimensions:  (lat: 2, lon: 2)
# Dimensions without coordinates: lat, lon
# Data variables:
#    mydata   (lat, lon) float32 0.0 0.0 0.0 2.0

adriaat avatar Sep 17 '24 12:09 adriaat

Using your snippet works, I think, because you define mydata as an array, not a masked array. If you instead use a masked array in your simple example, then we recreate the behaviour of my snippet:

Thanks @adriaat, after some thinking I came to the conclusion, that there is more involved here. Digging for root-cause is way over my pay-grade. But I have one more workaround:

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.DataArray(
         da.stack(
         [da.from_array(np.array([[np.nan, np.inf], [np.nan, 2]])) for _ in range(10)], axis=0
     ).astype('float32'), dims=('time', 'lat', 'lon')).to_dataset(name='mydata')
ds = xr.apply_ufunc(da.ma.masked_invalid, ds, dask='allowed')
ds = xr.apply_ufunc(da.ma.filled, ds, dask='allowed')
ds.mean('time').to_netcdf('foo.nc', encoding=dict(mydata=dict(_FillValue=np.float32(1e20))))

kmuehlbauer avatar Sep 17 '24 13:09 kmuehlbauer

when mydata is the result of a masked array, then if it has NaNs these are saved (or read?) as 0 instead of NaNs.

JFTR, they are saved as zero (0) on-disk. So this happens when writing out to disk. Somewhere in the computing step this went havoc. I've tried to follow the code-path, but failed.

kmuehlbauer avatar Sep 17 '24 13:09 kmuehlbauer

I noticed the plan to close label, but seems like we have an MCVE since then.

Are masked arrays something we support?

max-sixty avatar Oct 13 '24 17:10 max-sixty