xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Simple zarr save/load of dataset/datatree deletes contents of variables

Open flyingfalling opened this issue 6 months ago • 11 comments

What happened?

This is basically a failure to round-trip through zarr.

When saving and then loading and then re-saving through zarr, the contents of "string" type coordinates or variables is deleted.

I have been banging my head against this for about a month because certain "get" functions in XARRAY seem to have side-effects (e.g. to_numpy, as_numpy).

(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> pip freeze | grep xarray
xarray==2025.6.1
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> pip freeze | grep numpy
numpy==2.2.6
(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> pip freeze | grep zarr
zarr==2.18.3

Here is a minimal example:

import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
    col1=('mydim', [1,2,3]),
    col2=('mydim',['aa','be','cefe']) ) 
                 );

fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #REV: this is deleted.

Output is:

(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py 
['' '' '']

Note that a simple modification (reading using to_numpy or as_numpy between the second load/save) causes this to disappear:


import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
    col1=('mydim', [1,2,3]),
    col2=('mydim',['aa','be','cefe']) ) 
                 );

fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #WORKS FINE

Output:

(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py 
['aa' 'be' 'cefe']
['aa' 'be' 'cefe']

Numeric columns seem unaffected although I have observed situations where they too disappear or are filled with random memory garbage:

import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
    col1=('mydim', [1,2,3]),
    col2=('mydim',['aa','be','cefe']) ) 
                 );

fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col1'].to_numpy()); #WORKS FINE

Output:

(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py 
[1 2 3]

For some reason this only happens when have dimensions names different than their array names. If I do not set the dimensions of the contents, or set each to its own independent dimension with the same name as the variable, then everything works fine (of course this is only relevant for a 1D array variable):

import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
    col1=('col1', [1,2,3]),
    col2=('col2',['aa','be','cefe']) ) 
                 );

fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" on the dataarray before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); ## WORKS FINE

Output:

(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py 
['aa' 'be' 'cefe']

Note that everything else in the dataset is correct, it's just the data that is deleted (replaced with empty strings...).

import xarray as xr
mode='w';

ds1 = xr.Dataset( data_vars=dict(
    col1=('mydim', [1,2,3]),
    col2=('mydim',['aa','be','cefe']) ) 
                 );

print(ds1);

fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #REV: this is deleted.
print(ds1);

Output:

(marmovenv) riveale@rvzen13s:~/richard_home/git/marmostim/pymarmostim/tests> python zarr_roundtrip.py 
<xarray.Dataset> Size: 72B
Dimensions:  (mydim: 3)
Dimensions without coordinates: mydim
Data variables:
    col1     (mydim) int64 24B 1 2 3
    col2     (mydim) <U4 48B 'aa' 'be' 'cefe'
['' '' '']
<xarray.Dataset> Size: 72B
Dimensions:  (mydim: 3)
Dimensions without coordinates: mydim
Data variables:
    col1     (mydim) int64 24B ...
    col2     (mydim) <U4 48B '' '' ''

What did you expect to happen?

The data contents is not deleted (strings all become empty strings ''), i.e. correct round-trip through zarr.

Minimal Complete Verifiable Example

import xarray as xr
mode='w';
ds1 = xr.Dataset( data_vars=dict(
    col1=('mydim', [1,2,3]),
    col2=('mydim',['aa','be','cefe']) ) 
                 );

fname='testds.zarr';
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
#print(ds1['col2'].to_numpy()); #REV: If we call "to_numpy" before saving it the second time, for some reason it is saved properly...
ds1.to_zarr(fname, mode=mode);
ds1 = xr.open_dataset(fname);
print(ds1['col2'].to_numpy()); #REV: this is deleted.

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output


Anything else we need to know?

No response

Environment

flyingfalling avatar Jul 02 '25 07:07 flyingfalling

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Jul 02 '25 07:07 welcome[bot]

I'm realizing that this likely has something to do with the fact that a simple open_dataset() will lazily open dataset without actually reading the contents. Is this the reason for the problem?

Writing "to_zarr" on a lazily loaded dataset or datatree reference should, in my opinion, actually read the data contents and write it to the new specified location. But it seems like something else is happening (data contents not read, but metadata/dimensions are?).

But, maybe this is intended behavior? For some extreme situations with compute() being required or something?

flyingfalling avatar Jul 02 '25 07:07 flyingfalling

A workaround is to simply call load() every time one calls open_dataset() (or accesses a dataset from an open_datatree()).

It seems the other attributes etc. are saved properly.

flyingfalling avatar Jul 02 '25 07:07 flyingfalling

This sounds frustrating 😕

Writing "to_zarr" on a lazily loaded dataset or datatree reference should, in my opinion, actually read the data contents and write it to the new specified location.

Your understanding is correct. If it's not doing that (and compute=True) then something weird and incorrect is going on. to_zarr(compute=True) literally calls load...

But it seems like something else is happening (data contents not read, but metadata/dimensions are?).

This sounds more like the intended behaviour for compute=False.

the contents of "string" type coordinates or variables is deleted.

The empty strings implies this is some kind of fill_value bug.

TomNicholas avatar Jul 02 '25 16:07 TomNicholas

it looks like this has something to do with zarr=2 vs zarr=3 (upgrading to zarr>=3 properly reads back the data), but not sure if that's a bug within zarr-python=2 or in the way xarray interacts with it

keewis avatar Jul 02 '25 16:07 keewis

Sorry for the delay, I used my workaround and was doing other things.

I'm confused by your "zarr=2" vs "zarr=3".

Do you mean that my zarr is version 2 (2.18.3 in the example) and that this is not supported?

Do I need to update to zarr 3 (note zarr=3 requires python>=3.11...which breaks a lot of other things which require python <=3.10)

flyingfalling avatar Jul 22 '25 09:07 flyingfalling

Thank you for the responses, however. Yes, I am calling to_zarr using default parameters, and default for compute=True.

flyingfalling avatar Jul 22 '25 09:07 flyingfalling

Do you mean that my zarr is version 2 (2.18.3 in the example) and that this is not supported?

yes, I think your version is zarr=2 since you're seeing this error, but we didn't drop zarr=2 yet so it is still supported.

What I meant was that this looks like a bug, either in xarray or zarr=2, and that we'd be open to fixing it (xarray, at least, I don't know the support status of zarr=2).

However, xarray dropped python=3.10 as well, so you won't have much more luck with xarray, either. You could manually backport the fix once it exists and build the wheel yourself, but as time goes on this will put more and more of a burden on yourself so at some point I'd recommend upgrading those other things you mentioned.

keewis avatar Jul 28 '25 10:07 keewis

I see, that's unfortunate since python 3.10 is the distro in ubuntu 22.04 (I imagine many supercomputer/workstation clusters are still running this), and a lot of other packages still are <=3.10 due to large syntax changes in 3.11.

But, I understand, I'll try newer versions and see what works, but if you are dropping python 3.10 you should also drop zarr=2, since they two are equivalent (python<=3.10 is required for zarr=2)

flyingfalling avatar Oct 17 '25 00:10 flyingfalling

I'd argue that the distro version of python is supposed to only be used for the tools provided by that distro.

Either way, zarr-python=2.18.7 also requires python>=3.11, so I don't think that's much of an issue. I don't know how long we'll support zarr-python=2, though, it seems that as long as someone is willing to put in the work / pay someone to do it, versions of zarr-python=2 will keep being released.

keewis avatar Oct 22 '25 09:10 keewis

Thanks for your work!

Yea, I don't really care about zarr=2 or zarr=3 support, it just so happens that in this case it led to my bug. In general I drop old stuff quickly in my projects as well hah.

I went ahead and got python3.11 and moved to zarr3 so the point is moot in my case!

flyingfalling avatar Oct 22 '25 11:10 flyingfalling