xarray
xarray copied to clipboard
Writing to an open Dataset corrupts the Dataset's memory
What happened?
When holding a reference to an open, lazily-loaded Dataset ds, overwriting the file that backs ds causes ds to reference invalid memory.
What's weird is that this happens when using "coords", and doesn't happen without them.
I understand this access pattern is a bit weird, but I'm raising this bug in case this isn't an expected behaviour.
What did you expect to happen?
I expected ds to continue existing in memory, with its original values. Failing that, I expected a warning or error, either when the file backing ds was overwritten or, more realistically, when I went to access ds after it had been overwritten and was no longer referencing valid memory.
Minimal Complete Verifiable Example
import xarray as xr
# Test without coordinates (works as expected)
print("Testing WITHOUT coordinates:")
xr.Dataset({"x": [1, 2], "y": [3, 4]}).to_netcdf("test1.nc")
lazy1 = xr.open_dataset("test1.nc")
# overwrite the file
xr.Dataset({"z": [5, 6]}).to_netcdf("test1.nc")
# lazy1 still holds valid data
print(f"Expected y: [3, 4], Actual: {lazy1['y'].values}")
# Test with coordinates (invalid memory access)
print("\nTesting WITH coordinates:")
xr.Dataset({"x": ("dim", [1, 2]), "y": ("dim", [3, 4])}, coords={"dim": [0, 1]}).to_netcdf("test2.nc")
lazy2 = xr.open_dataset("test2.nc")
# overwrite the file
xr.Dataset({"z": ("dim", [5, 6])}, coords={"dim": [0, 1]}).to_netcdf("test2.nc")
# Accesses invalid memory
print(f"Expected y: [3, 4], Actual: {lazy2['y'].values}")
MVCE confirmation
- [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [x] Complete example — the example is self-contained, including all data and the text of any traceback.
- [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
Anything else we need to know?
No response
Environment
xarray: 2025.6.1 pandas: 2.3.0 numpy: 2.2.6 scipy: 1.15.3 netCDF4: None pydap: None h5netcdf: None h5py: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.10.3 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: None pip: None conda: None pytest: None mypy: None IPython: None sphinx: None
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
Here's perhaps a more realistic use-case that triggered the bug:
import xarray as xr
xr.Dataset({"x": ("dim", [1.0, 2.0]), "y": ("dim", [3.0, 4.0])}, coords={"dim": [0, 1]}).to_netcdf("test.nc")
# Original data
ds1 = xr.open_dataset("test.nc")
# Some new data
ds2 = xr.Dataset({"z": ("dim", [5, 6])}, coords={"dim": [0, 1]})
# Merge them together
merged = xr.merge([ds1, ds2])
# print(merged["y"].values) # uncommenting this materializes `merged` and fixes the bug
# Write the data back to the file
merged.to_netcdf("test.nc")
# We still have a handle to `merged`, which perhaps should point to
# the contents of the new file? Instead, it (partially) points to the old
# file, which is now overwritten.
print(merged["y"].values) # => prints something like "[0.00000000e+000 8.48798317e-314]"
Possibly related issue: #4280
@williamsnell Thanks for the report and the examples.
This does not reproduce with engine="nectdf4" or engine="h5netcdf". On the second write I'm getting a Permission denied error. So this is only an issue for engine="scipy". I think I've seen similar issues with scipy on the tracker before.
Ah interesting! I'll change engines and check I can't reproduce. Should I raise an issue in the scipy repository?
@williamsnell Thanks for testing. If you can boil down an example with just scipy (taking xarray out of the equation) then an issue at scipy would be reasonable. Otherwise anything which helps to spot the origin within xarray is much appreciated.
IIRC the scipy backend normally loads all data into memory. I'm not sure how this works wrt xarray caching and lazy loading.
IIRC the scipy backend normally loads all data into memory. I'm not sure how this works wrt xarray caching and lazy loading.
It's about to become lazy thanks to @dcherian's changes in #10376.
I expected ds to continue existing in memory, with its original values. Failing that, I expected a warning or error, either when the file backing ds was overwritten or, more realistically, when I went to access ds after it had been overwritten and was no longer referencing valid memory.
Stepping back, I'm confused by this conversation, because I'm not sure any of these are reasonable expectations. How is xarray supposed to know that you've changed the file? If you open a reference to the file (i.e. create a pointer to the location of some bytes on disk), but don't load the data in yet (i.e. "lazy loading"), but then delete the file entirely before you loaded the data, we can't expect xarray to still be able to load the data. It doesn't exist anymore! Making arbitrary modifications to the file is no different from deleting it in this respect.
(Note this is all for lazy loaded data - if you called .load() on the whole dataset before modifying the file, then that's safe, because your dataset no longer even knows where it got that data.)
If you want to be able to make arbitrary changes to your data on-disk whilst other applications such as xarray are currently (lazily) reading from the data, then you need an ACID database, and that's what Icechunk was designed to solve. Does any equivalent of that transactional functionality exist for the scipy repo?
a warning or error, either when the file backing ds was overwritten
I guess xarray could potentially check that the last altered time of the file is earlier than the time the file was opened, but this still feels like opening a can of worms. And Icechunk + VirtualiZarr can already do that, using the last_updated_at kwarg.
Hey, thanks for the reply!
I'm happy to revise what I'd expect to happen to the following:
- when I write to an open file with the
netcdf4backend, I get aPermissionError - when I write to an open file with the scipy backend, I don't get any error. It keeps trucking (and IMO it shouldn't)
There is a lot about xarray's backend workings I don't understand well. In attempting to see if this is a scipy issue or an xarray issue, I noticed that the netcdf4 backend calls _acquire, which is where the PermissionError stems from.
However, I could not find an equivalent call for the scipy backend implementation. Again, I don't understand xarray well enough to say whether parity is a reasonable thing to expect between these two backends. But as a naive user, I think a PermissionError is a reasonable thing to throw, and I would expect it to always be thrown.
Thanks for the clarification, that makes a lot more sense to me! That definitely sounds like a confusing and error-prone inconsistency between the behaviour of the two backends.
There is a lot about xarray's backend workings I don't understand well.
Join the club 😅 They do a lot.
@TomNicholas @williamsnell I was already looking into that, trying to align scipy backend with the other backends wrt file locking. I'll wait for @dcherian's #10376 and see if we can improve.
I guess xarray could potentially check that the last altered time of the file is earlier than the time the file was opened, but this still feels like opening a can of worms. And Icechunk + VirtualiZarr can already do that, using the
last_updated_atkwarg.
I've thought about checking file times, too, but it seems not only OS dependent but also feels not the right solution. Great to see this implemented downstream.