tiled
tiled copied to clipboard
Clarify and test the use of xarray(dataset) serialization to netCDF
See the discussion in https://github.com/bluesky/tiled/pull/560#issuecomment-1702776738 -- recapped here.
We should add a test that covers the use of serialization.xarray.serialize_netcdf()
. It might also resolve some ambiguity about how that coroutine should be used (or possibly re-implemented).
- How should I use the result of
serialize_netcdf()
to re-create the dataset? @martintb tried a few ways and proposed one option for a workaround. - The byte string returned by
xarray.Dataset.to_netcdf()
is not quite equal to the value in the stream buffer that gets returned byserialize_netcdf()
-- the final assertion below fails. Most bytes are the same, but you can see where they differ by pasting this loop intoserialize_netcdf()
. - Why does
serialize_netcdf()
return a memory view into the buffer of the freshly created BytesIO stream? A memory view would be efficient if the file/stream existed before the call toserialize_netcdf()
, but that is not the case here. Per thexarray.Dataset.to_netcdf()
documentation, we could instead have that call directly return a byte string by omitting thepath
parameter.
assert len(bytes_) == len(bytes(file.getbuffer())) # true, for all cases tested so far
for i,(a,b) in enumerate(zip(bytes_, bytes(file.getbuffer()))):
if a != b:
print(f'<{i}> [{a}]_[{b}]')
assert bytes_ == bytes(file.getbuffer()) # fails
Okay, so I've mostly figured out the issue but it's a little tedious.
BLUF: tiled
should just use the byte string returned by xarray.Dataset.to_netcdf()
.
The xarray/scipy netcdf writer combo is weird and is calling .close()
multiple times on the scipy.io.netcdf_file
object which isn't blocked by the modification to the close method in _BytesIOThatIgnoresClose
. The multiple calls to .close()
cause multiple calls to .flush()
and ._write()
which butcher the contents of the file-like object.
The lines below show the core of the problem in the xarray codebase. In xarray.backends.api.to_netcdf()
, target.getvalue()
is returned on L1268 and then store.close()
is called in L1271.
https://github.com/pydata/xarray/blob/f13da94db8ab4b564938a5e67435ac709698f1c9/xarray/backends/api.py#L1252C1-L1271
This close()
propagates all the way down (through three layers of abstraction) to the scipy.io.netcdf_file
object which has a close method that looks like this
https://github.com/scipy/scipy/blob/686422c4f0a71be1b4258309590fd3e9de102e18/scipy/io/_netcdf.py#L289-L314
Unfortunately, the netcdf_file
authors also decided to override __del__=.close()
so that, when the .to_netcdf()
method returns on L1268, the garbage collector collects the store
object (which contains the closed netcdf_file
object) and triggers a second write to the file buffer. This should likely throw an exception but instead it just butchers the buffer contents.
I could submit a pull request to scipy that attempts to harden and simplify their close/sync/write behavior. Honestly, all they need is an is_closed
boolean and some logic in the flush/write methods. In the short term, it seems easier for tiled
to work with the byte string returned by to_netcdf()
as that should be guaranteed to recreate the dataset (otherwise why return it at all).