xarray to_netcdf broken encoding: dtype='S1' + chunksizes

xarray.Dataset({'x': ['foo', 'bar', 'baz']}).to_netcdf(
    'foo.nc', engine='h5netcdf',
    encoding={'x': {'dtype': 'S1', 'zlib': True, 'chunksizes': (2, )}})

ValueError: "chunks" must have same rank as dataset shape

Same with engine='netcdf4'. The issue is present in 0.10.6 as well as in 0.10.3. The problem is obviously that dtype=S1 changes the shape of the variable before passing it to the backend, but while doing so doesn't also change an eventual chunksizes setting.

The workaround is to omit chunksizes or set it to True.

Jun 07 '18 23:06 crusaderky

It looks like this version works:

xarray.Dataset({'x': ['foo', 'bar', 'baz']}).to_netcdf(
    'foo.nc', engine='h5netcdf',
    encoding={'x': {'dtype': 'S1', 'zlib': True, 'chunksizes': (2, 3)}})

I suppose we could update chunksizes to accept both versions? Or just clearly document this behavior?

Jun 08 '18 01:06 shoyer

IMHO the trick that alters the shape of the array is strictly an implementation detail which should not be exposed to the end user. If the implementation of xarray alters the shape of the variable, it should as well alter anything that relies on it. So I think that chunksizes=(2, 3) should not be accepted as a valid input.

Jun 14 '18 13:06 crusaderky

As part of keeping our issue count <1000, closing as unlikely to inspire change, please reopen if anyone disagrees

Aug 28 '24 18:08 max-sixty