xarray
xarray copied to clipboard
Lingering memory connections when extracting underlying `np.arrays` from datasets
What is your issue?
I know that generally, ds2 = ds connects the two objects in memory, and changes in one will also cause changes in the other.
However, I generally assume that certain operations should break this connection, for example:
- extracting the underlying
np.arrayfrom a dataset (changing its type and destroying a lot of the xarray-specific information: index, dimensions, etc.) - using the underlying
np.arrayinto a new dataset
In other words, I would expect that using ds['var'].values would be similar to copy.deepcopy(ds['var'].values).
Here's an example that illustrates how in these cases, the objects are still linked in memory:
(apologies for the somewhat hokey example)
import xarray as xr
import numpy as np
# Create a dataset
ds = xr.Dataset(coords = {'lon':(['lon'],np.array([178.2,179.2,-179.8, -178.8,-177.8,-176.8]))})
print('\nds: ')
print(ds)
# Create a new dataset that uses the values of the first dataset
ds2 = xr.Dataset({'lon1':(['lon'],ds.lon.values)},
coords = {'lon':(['lon'],ds.lon.values)})
print('\nds2: ')
print(ds2)
# Change ds2's 'lon1' variable
ds2['lon1'][ds2['lon1']<0] = 360 + ds2['lon1'][ds2['lon1']<0]
# `ds2` is changed as expected
print('\nds2 (should be modified): ')
print(ds2)
# `ds` is changed, which is *not* expected
print('\nds (should not be modified): ')
print(ds)
The question is - am I right (from a UX perspective) to expect these kinds of operations to disconnect the objects in memory? If so, I might try to update the docs to be a bit clearer on this. (or, alternatively, if these kinds of operations should disconnect the objects in memory, maybe it's better to have .values also call .copy(deep=True).values)
Appreciate y'all's thoughts on this!
In general, you're expected to deep-copy explicitly to break these "links". This is the numpy paradigm
If you want to read up on this, look for "view vs copy"!
Yeah, I guess in this case from a legibility standpoint, the fact that .values 'changes' (from the user point of view) the form (and type) of the data from a DataArray to the underlying numpy array just feels different?
Like I wouldn't expect the following two operations:
a = np.ones(3)
b = a.astype(str)
a[0] = 5
print(b)
and
a = np.ones(3)
b = a
a[0] = 5
print(b)
to behave the same. But I do understand that from the backend perspective, .values seems to be more of the latter than the former, since it is just accessing something that's already there...
(relatedly, would it be worth it to link to the relevant numpy docs in this part of the xarray docs?)
A related issue is that this allows you to (possibly inadvertently) circumvent certain xarray safeguards, like the TypeError around not being able to modify IndexVariables:
# Create sample dataset
ds = xr.Dataset({'test':(['lon'],[5,6,7])},coords = {'lon':(('lon'),[0,1,2])})
# Raises TypeError, to avoid changing indices like this
ds['lon'][0] = 2
# Now, extract underly numpy array
a = ds.lon.values
# Change value
a[0] = 2
# This changes `ds` without raising error
print(ds)
(relatedly, would it be worth it to link to the relevant numpy docs in this part of the xarray docs?)
Yes! That would be a welcome contribution.
A related issue is that this allows you to (possibly inadvertently) circumvent certain xarray safeguards, like the
TypeErroraround not being able to modifyIndexVariables:
Yes. But I'm not sure there's much we can do about this. Our focus should be "if you use xarray operations, you won't get surprises"...
Yes! That would be a welcome contribution.
Sounds good, I'll prep a PR
Resolved by #8744.