xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Whether a DataArray is copied when inserted into a Dataset depends on whether coordinates match exactly

Open anntzer opened this issue 10 years ago • 18 comments

Consider

import numpy as np
from xray import *

ds = Dataset({"a": DataArray(np.zeros((3, 4)))})
ds["b"] = b = DataArray(np.zeros((3, 4)))
b[0, 0] = 1
print(ds["b"][0, 0]) # ==> prints 1

ds = Dataset({"a": DataArray(np.zeros((3, 4)))})
ds["b"] = b = DataArray(np.zeros((3, 3)))  # !!! we implicitly fill the last column with nans.
b[0, 0] = 1
print(ds["b"][0, 0]) # ==> prints 0

In the first case, the dataset was modified when the dataarray was modified, but not in the second case.

anntzer avatar Oct 19 '15 23:10 anntzer

Yes, this is certainly the case -- do you suggest doing this differently? If coordinates are already aligned, I would rather not do unnecessary copies -- this could be quite expensive.

If this was surprising to you, then it is certainly worth documenting somewhere.

shoyer avatar Oct 20 '15 00:10 shoyer

I was more expecting that if the coordinates are not already aligned, then the DataArray gets modified to point to the "larger" array (though of course there may be issues with other variables pointing to the same array, etc.).

I like the pattern

some_explicit_dict_name[some_explicit_entry_name] = tmpvar = np.array(...)
<... work work work on initializing tmpvar, expecting to initialize some_explicit_dict_name[some_explicit_entry_name]>

(because the explicit name is a bit too long to work with and I make it immediately clear what tmpvar is used for), but xray's behavior makes this work (on DataArrays) only sometimes, with no warning that this could fail to update the array I care about.

Not sure what the best approach is. Perhaps

tmpvar = dataset.new_dataarray(*args_and_**kwargs_to_DataArray.__init__)

which would make sure that no one else is pointing to the internal array at that point?

anntzer avatar Oct 20 '15 14:10 anntzer

If you use the DataArray constructor instead of inserting an item in a Dataset, the array values will always be a view.

If you pre-align the array you are about to insert using xray.align, it is also guaranteed to be a view.

Unfortunately, it's impossible to make such automatically aligned arrays views in general, because of numpy's memory model.

One thing we could do is add an option to make inserted items always a copy. I would probably put this on the Dataset.merge method, which is what Dataset.__getitem__ uses under the hood.

shoyer avatar Oct 21 '15 18:10 shoyer

"If you use the DataArray constructor instead of inserting an item in a Dataset, the array values will always be a view." Can you explain what you mean here? I am using the DataArray constructor and then inserting the item in the Dataset.

anntzer avatar Oct 21 '15 18:10 anntzer

"If you use the DataArray constructor instead of inserting an item in a Dataset, the array values will always be a view." Can you explain what you mean here? I am using the DataArray constructor and then inserting the item in the Dataset.

Yes, I should clarify -- the DataArray you create with the constructor will be a view. When you insert it into a Dataset, it may no longer be a view (if the index coordinates are not aligned).

shoyer avatar Oct 21 '15 19:10 shoyer

Thinking about it again, I think there could just be a mention in the docs that the pattern

foo = dataset[key] = DataArray(...)

should be avoided and written as

dataset[key] = DataArray(...)
foo = dataset[key]

That should be good enough?

anntzer avatar Oct 21 '15 22:10 anntzer

@anntzer Yes, that would be a reasonable thing to state in the docs. Though I'm not sure how widespread that pattern is -- this is the first time I've seen it used.

shoyer avatar Oct 24 '15 01:10 shoyer

Sorry to raise the issue again but I just noticed that I had a couple of other places where I wrote quite naturally

d = DataArray(...)
e = DataArray(...)
ds = Dataset({"d": d, "e": e})
<work to initialize d and e>

which is equivalent to the form I was mentioning earlier (whether ds will actually get initialized properly depends on whether d and e have exactly matching coordinates).

To be honest I don't really understand what you mean by "Unfortunately, it's impossible to make such automatically aligned arrays views in general, because of numpy's memory model.": when the line ds = Dataset(...) gets executed, the ndarrays ds.d.values and ds.e.values either point to the old ndarrays d.values and e.values or to some newly created, coordinate-aligned ndarrays. If it is the latter, one can just modify d.values and e.values (as well as their coordinates systems) to point to the new ndarrays, right?

anntzer avatar Oct 29 '15 07:10 anntzer

If it is the latter, one can just modify d.values and e.values (as well as their coordinates systems) to point to the new ndarrays, right?

Yes, in principle we could do that. But it's very surprising for a function to modify it's input arguments, so usually we try to avoid things like that.

If you put arrays in a Dataset, you should not count on modifications to the arrays in the Dataset trickling back to the original arrays.

Generally, I would recommend something like the following:

d = DataArray(...)
e = DataArray(...)
d, e = xray.align(d, e, join='outer')
<work to initialize d and e>
ds = Dataset({"d": d, "e": e})

or

d = DataArray(...)
e = DataArray(...)
ds = Dataset({"d": d, "e": e})
<work to initialize ds.d and ds.e>

shoyer avatar Oct 29 '15 17:10 shoyer

If you put arrays in a Dataset, you should not count on modifications to the arrays in the Dataset trickling back to the original arrays.

I think the most confusing part is that this the modifications will trickle down sometimes -- even though I'd prefer them to be always propagated, I would still mind much less if they were never propagated. At the end of the day it's not my call, feel free to close the issue if you want.

anntzer avatar Oct 29 '15 18:10 anntzer

I think the most confusing part is that this the modifications will trickle down sometimes -- even though I'd prefer them to be always propagated, I would still mind much less if they were never propagated.

I agree! I am going to try to clarify and clean this up a little bit for the next version of xray. The only part that should ever trickle back is the values (because that's unavoidable for high performance). Right now, modifications to the metadata (.attrs) can also sometimes trickle back.

shoyer avatar Oct 29 '15 18:10 shoyer

Okay, so the Dataset constructor needs to create a copy of the metadata. Could we include a argument to the constructor that copies the .values too? Obviously there is a performance hit there but it would allow users to be explicit about their own memory handling. Just an idea.

jhamman avatar Oct 29 '15 18:10 jhamman

Yes, that's a great idea! We could add a copy=False argument to both the Dataset and DataArray constructors, like the align function.

On Thu, Oct 29, 2015 at 11:58 AM, Joe Hamman [email protected] wrote:

Okay, so the Dataset constructor needs to create a copy of the metadata. Could we include a argument to the constructor that copies the .values too? Obviously there is a performance hit there but it would allow users to be explicit about their own memory handling. Just an idea.

Reply to this email directly or view it on GitHub: https://github.com/xray/xray/issues/630#issuecomment-152287198

shoyer avatar Oct 29 '15 19:10 shoyer

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

stale[bot] avatar Jan 30 '19 10:01 stale[bot]

Looks like the implicit nan-filling behavior is not present anymore, which likely killed the issue?

anntzer avatar Jan 30 '19 10:01 anntzer

The original example how gives an error about dim_1 have different size, but you can still reproduce the behavior if you provide explicit labels for dim_1, e.g.,

ds = Dataset({"a": DataArray(np.zeros((3, 4)))}, {'dim_1': range(4)})
ds["b"] = b = DataArray(np.zeros((3, 3)), {'dim_1': range(3)})  # !!! we implicitly fill the last column with nans.
b[0, 0] = 1
print(ds["b"][0, 0]) # ==> prints 0

The copy argument is still a good idea. We could do:

  • copy=True always copy
  • copy=None copy if necessary (default)
  • copy=False never copy (raise if not possible)

(Note that copy=False in align currently means "copy if necessary", so this would be a breaking change if users are setting that explicitly)

shoyer avatar Jan 31 '19 18:01 shoyer

Trying to keep us below 1K issues — is this still current?

max-sixty avatar Jun 23 '24 17:06 max-sixty

Yes.

anntzer avatar Jun 24 '24 14:06 anntzer