tiled icon indicating copy to clipboard operation
tiled copied to clipboard

Clarify and test the use of xarray(dataset) serialization to netCDF

Open padraic-shafer opened this issue 1 year ago • 1 comments

See the discussion in https://github.com/bluesky/tiled/pull/560#issuecomment-1702776738 -- recapped here.

We should add a test that covers the use of serialization.xarray.serialize_netcdf(). It might also resolve some ambiguity about how that coroutine should be used (or possibly re-implemented).

  1. How should I use the result of serialize_netcdf() to re-create the dataset? @martintb tried a few ways and proposed one option for a workaround.
  2. The byte string returned by xarray.Dataset.to_netcdf() is not quite equal to the value in the stream buffer that gets returned by serialize_netcdf() -- the final assertion below fails. Most bytes are the same, but you can see where they differ by pasting this loop into serialize_netcdf().
  3. Why does serialize_netcdf() return a memory view into the buffer of the freshly created BytesIO stream? A memory view would be efficient if the file/stream existed before the call to serialize_netcdf(), but that is not the case here. Per the xarray.Dataset.to_netcdf() documentation, we could instead have that call directly return a byte string by omitting the path parameter.
        assert len(bytes_) == len(bytes(file.getbuffer()))  # true, for all cases tested so far

        for i,(a,b) in enumerate(zip(bytes_, bytes(file.getbuffer()))):
            if a != b:
                print(f'<{i}> [{a}]_[{b}]')

        assert bytes_ == bytes(file.getbuffer())  # fails

padraic-shafer avatar Sep 05 '23 13:09 padraic-shafer

Okay, so I've mostly figured out the issue but it's a little tedious.

BLUF: tiled should just use the byte string returned by xarray.Dataset.to_netcdf().

The xarray/scipy netcdf writer combo is weird and is calling .close() multiple times on the scipy.io.netcdf_file object which isn't blocked by the modification to the close method in _BytesIOThatIgnoresClose. The multiple calls to .close()cause multiple calls to .flush() and ._write() which butcher the contents of the file-like object.

The lines below show the core of the problem in the xarray codebase. In xarray.backends.api.to_netcdf(), target.getvalue() is returned on L1268 and then store.close() is called in L1271.

https://github.com/pydata/xarray/blob/f13da94db8ab4b564938a5e67435ac709698f1c9/xarray/backends/api.py#L1252C1-L1271

This close() propagates all the way down (through three layers of abstraction) to the scipy.io.netcdf_file object which has a close method that looks like this

https://github.com/scipy/scipy/blob/686422c4f0a71be1b4258309590fd3e9de102e18/scipy/io/_netcdf.py#L289-L314

Unfortunately, the netcdf_file authors also decided to override __del__=.close() so that, when the .to_netcdf() method returns on L1268, the garbage collector collects the store object (which contains the closed netcdf_file object) and triggers a second write to the file buffer. This should likely throw an exception but instead it just butchers the buffer contents.

I could submit a pull request to scipy that attempts to harden and simplify their close/sync/write behavior. Honestly, all they need is an is_closed boolean and some logic in the flush/write methods. In the short term, it seems easier for tiled to work with the byte string returned by to_netcdf() as that should be guaranteed to recreate the dataset (otherwise why return it at all).

martintb avatar Sep 06 '23 02:09 martintb