xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Writing to an open Dataset corrupts the Dataset's memory

Open williamsnell opened this issue 5 months ago • 10 comments

What happened?

When holding a reference to an open, lazily-loaded Dataset ds, overwriting the file that backs ds causes ds to reference invalid memory.

What's weird is that this happens when using "coords", and doesn't happen without them.

I understand this access pattern is a bit weird, but I'm raising this bug in case this isn't an expected behaviour.

What did you expect to happen?

I expected ds to continue existing in memory, with its original values. Failing that, I expected a warning or error, either when the file backing ds was overwritten or, more realistically, when I went to access ds after it had been overwritten and was no longer referencing valid memory.

Minimal Complete Verifiable Example

import xarray as xr

# Test without coordinates (works as expected)

print("Testing WITHOUT coordinates:")
xr.Dataset({"x": [1, 2], "y": [3, 4]}).to_netcdf("test1.nc")
lazy1 = xr.open_dataset("test1.nc")
# overwrite the file
xr.Dataset({"z": [5, 6]}).to_netcdf("test1.nc")

# lazy1 still holds valid data
print(f"Expected y: [3, 4], Actual: {lazy1['y'].values}")

# Test with coordinates (invalid memory access)
print("\nTesting WITH coordinates:")

xr.Dataset({"x": ("dim", [1, 2]), "y": ("dim", [3, 4])}, coords={"dim": [0, 1]}).to_netcdf("test2.nc")
lazy2 = xr.open_dataset("test2.nc")
# overwrite the file
xr.Dataset({"z": ("dim", [5, 6])}, coords={"dim": [0, 1]}).to_netcdf("test2.nc")

# Accesses invalid memory
print(f"Expected y: [3, 4], Actual: {lazy2['y'].values}")

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output


Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.11 (main, Mar 17 2025, 21:02:09) [Clang 20.1.0 ] python-bits: 64 OS: Linux OS-release: 6.8.0-60-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_NZ.UTF-8 LOCALE: ('en_NZ', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2025.6.1 pandas: 2.3.0 numpy: 2.2.6 scipy: 1.15.3 netCDF4: None pydap: None h5netcdf: None h5py: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.10.3 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: None pip: None conda: None pytest: None mypy: None IPython: None sphinx: None

williamsnell avatar Jun 12 '25 05:06 williamsnell

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Jun 12 '25 05:06 welcome[bot]

Here's perhaps a more realistic use-case that triggered the bug:

import xarray as xr

xr.Dataset({"x": ("dim", [1.0, 2.0]), "y": ("dim", [3.0, 4.0])}, coords={"dim": [0, 1]}).to_netcdf("test.nc")

# Original data
ds1 = xr.open_dataset("test.nc")

# Some new data
ds2 = xr.Dataset({"z": ("dim", [5, 6])}, coords={"dim": [0, 1]})

# Merge them together
merged = xr.merge([ds1, ds2])

# print(merged["y"].values) # uncommenting this materializes `merged` and fixes the bug

# Write the data back to the file
merged.to_netcdf("test.nc")

# We still have a handle to `merged`, which perhaps should point to 
# the contents of the new file? Instead, it (partially) points to the old 
# file, which is now overwritten.
print(merged["y"].values) # => prints something like "[0.00000000e+000 8.48798317e-314]"


williamsnell avatar Jun 12 '25 05:06 williamsnell

Possibly related issue: #4280

kmuehlbauer avatar Jun 12 '25 06:06 kmuehlbauer

@williamsnell Thanks for the report and the examples.

This does not reproduce with engine="nectdf4" or engine="h5netcdf". On the second write I'm getting a Permission denied error. So this is only an issue for engine="scipy". I think I've seen similar issues with scipy on the tracker before.

kmuehlbauer avatar Jun 13 '25 07:06 kmuehlbauer

Ah interesting! I'll change engines and check I can't reproduce. Should I raise an issue in the scipy repository?

williamsnell avatar Jun 13 '25 08:06 williamsnell

@williamsnell Thanks for testing. If you can boil down an example with just scipy (taking xarray out of the equation) then an issue at scipy would be reasonable. Otherwise anything which helps to spot the origin within xarray is much appreciated.

IIRC the scipy backend normally loads all data into memory. I'm not sure how this works wrt xarray caching and lazy loading.

kmuehlbauer avatar Jun 13 '25 08:06 kmuehlbauer

IIRC the scipy backend normally loads all data into memory. I'm not sure how this works wrt xarray caching and lazy loading.

It's about to become lazy thanks to @dcherian's changes in #10376.

I expected ds to continue existing in memory, with its original values. Failing that, I expected a warning or error, either when the file backing ds was overwritten or, more realistically, when I went to access ds after it had been overwritten and was no longer referencing valid memory.

Stepping back, I'm confused by this conversation, because I'm not sure any of these are reasonable expectations. How is xarray supposed to know that you've changed the file? If you open a reference to the file (i.e. create a pointer to the location of some bytes on disk), but don't load the data in yet (i.e. "lazy loading"), but then delete the file entirely before you loaded the data, we can't expect xarray to still be able to load the data. It doesn't exist anymore! Making arbitrary modifications to the file is no different from deleting it in this respect.

(Note this is all for lazy loaded data - if you called .load() on the whole dataset before modifying the file, then that's safe, because your dataset no longer even knows where it got that data.)

If you want to be able to make arbitrary changes to your data on-disk whilst other applications such as xarray are currently (lazily) reading from the data, then you need an ACID database, and that's what Icechunk was designed to solve. Does any equivalent of that transactional functionality exist for the scipy repo?

a warning or error, either when the file backing ds was overwritten

I guess xarray could potentially check that the last altered time of the file is earlier than the time the file was opened, but this still feels like opening a can of worms. And Icechunk + VirtualiZarr can already do that, using the last_updated_at kwarg.

TomNicholas avatar Jun 18 '25 06:06 TomNicholas

Hey, thanks for the reply!

I'm happy to revise what I'd expect to happen to the following:

  • when I write to an open file with the netcdf4 backend, I get a PermissionError
  • when I write to an open file with the scipy backend, I don't get any error. It keeps trucking (and IMO it shouldn't)

There is a lot about xarray's backend workings I don't understand well. In attempting to see if this is a scipy issue or an xarray issue, I noticed that the netcdf4 backend calls _acquire, which is where the PermissionError stems from.

However, I could not find an equivalent call for the scipy backend implementation. Again, I don't understand xarray well enough to say whether parity is a reasonable thing to expect between these two backends. But as a naive user, I think a PermissionError is a reasonable thing to throw, and I would expect it to always be thrown.

williamsnell avatar Jun 18 '25 06:06 williamsnell

Thanks for the clarification, that makes a lot more sense to me! That definitely sounds like a confusing and error-prone inconsistency between the behaviour of the two backends.

There is a lot about xarray's backend workings I don't understand well.

Join the club 😅 They do a lot.

TomNicholas avatar Jun 18 '25 06:06 TomNicholas

@TomNicholas @williamsnell I was already looking into that, trying to align scipy backend with the other backends wrt file locking. I'll wait for @dcherian's #10376 and see if we can improve.

I guess xarray could potentially check that the last altered time of the file is earlier than the time the file was opened, but this still feels like opening a can of worms. And Icechunk + VirtualiZarr can already do that, using the last_updated_at kwarg.

I've thought about checking file times, too, but it seems not only OS dependent but also feels not the right solution. Great to see this implemented downstream.

kmuehlbauer avatar Jun 18 '25 07:06 kmuehlbauer