xarray icon indicating copy to clipboard operation
xarray copied to clipboard

jupyter repr caching deleted netcdf file

Open michaelaye opened this issue 3 years ago • 9 comments

What happened:

Testing xarray data storage in a jupyter notebook with varying data sizes and storing to a netcdf, i noticed that open_dataset/array (both show this behaviour) continue to return data from the first testing run, ignoring the fact that each run deletes the previously created netcdf file. This only happens once the repr was used to display the xarray object. But once in error mode, even the previously fine printed objects are then showing the wrong data.

This was hard to track down as it depends on the precise sequence in jupyter.

What you expected to happen:

when i use open_dataset/array, the resulting object should reflect reality on disk.

Minimal Complete Verifiable Example:

import xarray as xr
from pathlib import Path
import numpy as np

def test_repr(nx):
    ds = xr.DataArray(np.random.rand(nx))
    path = Path("saved_on_disk.nc")
    if path.exists():
        path.unlink()
    ds.to_netcdf(path)
    return path

When executed in a cell with print for display, all is fine:

test_repr(4)
print(xr.open_dataset("saved_on_disk.nc"))
test_repr(5)
print(xr.open_dataset("saved_on_disk.nc"))

but as soon as one cell used the jupyter repr:

xr.open_dataset("saved_on_disk.nc")

all future file reads, even after executing the test function again and even using print and not repr, show the data from the last repr use.

Anything else we need to know?:

Here's a notebook showing the issue: https://gist.github.com/05c2542ed33662cdcb6024815cc0c72c

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-40-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4

xarray: 0.16.0 pandas: 1.0.5 numpy: 1.19.0 scipy: 1.5.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.5 cfgrib: None iris: None bottleneck: None dask: 2.21.0 distributed: 2.21.0 matplotlib: 3.3.0 cartopy: 0.18.0 seaborn: 0.10.1 numbagg: None pint: None setuptools: 49.2.0.post20200712 pip: 20.1.1 conda: installed pytest: 6.0.0rc1 IPython: 7.16.1 sphinx: 3.1.2

michaelaye avatar Jul 21 '20 02:07 michaelaye

Thanks for the clear example!

This happens dues to xarray's caching logic for files: https://github.com/pydata/xarray/blob/b1c7e315e8a18e86c5751a0aa9024d41a42ca5e8/xarray/backends/file_manager.py#L50-L76

This means that when you open the same filename, xarray doesn't actually reopen the file from disk -- instead it points to the same file object already cached in memory.

I can see why this could be confusing. We do need this caching logic for files opened from the same backends.*DataStore class, but this could include some sort of unique identifier (i.e., from uuid) to ensure each separate call to xr.open_dataset results in a separately cached/opened file object: https://github.com/pydata/xarray/blob/b1c7e315e8a18e86c5751a0aa9024d41a42ca5e8/xarray/backends/netCDF4_.py#L355-L357

shoyer avatar Jul 25 '20 01:07 shoyer

is there a workaround for forcing the opening without restarting the notebook?

michaelaye avatar Jul 25 '20 01:07 michaelaye

now i'm wondering why the caching logic is only activated by the repr? As you can see, when printed, it always updated to the status on disk?

michaelaye avatar Jul 25 '20 01:07 michaelaye

Probably the easiest work around is to call .close() on the original dataset. Failing that, the file is cached in xarray.backends.file_manager.FILE_CACHE, which you could muck around with.

I believe it only gets activated by repr() because array values from netCDF file are loaded lazily. Not 100% without more testing, though.

shoyer avatar Jul 25 '20 02:07 shoyer

Would it be an option to consider the time stamp of the file's last change as a caching criterion?

markusritschel avatar Aug 19 '20 13:08 markusritschel

I've stumbled over this weird behaviour many times and was wondering why this happens. So AFAICT @shoyer hit the nail on the head but the root cause is that the Dataset is added to the notebook namespace somehow, if one just evaluates it in the cell.

This doesn't happen if you invoke the __repr__ via

display(xr.open_dataset("saved_on_disk.nc"))

I've forced myself to use either print or display for xarray data. As this also happens if the Dataset is attached to a variable you would need to specifically delete (or .close()) the variable in question before opening again.

try: 
    del ds
except NameError:
    pass
ds = xr.open_dataset("saved_on_disk.nc")

kmuehlbauer avatar Jan 21 '21 15:01 kmuehlbauer

I have a tentative fix for this in https://github.com/pydata/xarray/pull/4879. It would be great if someone could give this a try to verify that it resolve the issue.

shoyer avatar Feb 07 '21 21:02 shoyer

+1 Complicated, still vexing this user a year+ later, but it easier for me to just restart the kernel again and again than read this and #4879, which is closed but didn't seem to have succeeded if I read correctly?

brianmapes avatar Sep 27 '22 02:09 brianmapes

Running xarray.backends.file_manager.FILE_CACHE.clear() fixed the issue for me. I couldn't find any other way to stop xarray from pulling up some old data from a newly saved file. I'm using the h5netcdf engine with xarray version 2022.6.0 by the way.

mullenkamp avatar Oct 04 '22 20:10 mullenkamp