array-api icon indicating copy to clipboard operation
array-api copied to clipboard

partial adoption of Array API for Xarray's `NamedArray`

Open andersy005 opened this issue 2 years ago • 13 comments

we've been having a discussion over in https://github.com/pydata/xarray/issues/8333 about adopting Array API and the possibility of offering an array-API compatible interface via namedarray, a new Xarray data structure that aims to be a lighter version of xarray's Variable: https://github.com/pydata/xarray/discussions/8080

although there's general agreement that xarray.NamedArray should not offer an array-API compatible interface --mainly due to Xarray's unique semantics like broadcasting by dimension name rather than dimension order (e.g. xp.mean(xarray.NamedArray, axis=1) vs xarray.NamedArray.mean(dim="time")--we're keen on adopting element-wise functions and how they would interact with xarray.

the most important point at issue is, we find that these element-wise functions like xp.abs(xarray.NamedArray) --> xarray.NamedArray are critical to our use-case. however, we're not looking to adopt the entire Array API standard, but are there any guidelines or recommendations for adopting a subset of it? we thought it's worth raising this issue, to see if folks here think this is a good idea or not.

andersy005 avatar Oct 31 '23 22:10 andersy005

I think the main thing to consider is how usable this would be for consumers of the array API. If you don't accept numbered axes, then any library written against the array API that has code using numbered axes won't actually work with xarray.

asmeurer avatar Oct 31 '23 23:10 asmeurer

I planned to use NamedArray just to keep track of dimension names across operations (so I do not have to implement that logic myself), but I needed it to be compatible with the array API standard. It is ok for me if you compare dimension names for some operations (such as attempting to sum different things) and raise an exception in that case. It ideally should also support operations between NamedArrays and the underlying wrapped (unnamed) array type. Would that not be possible?

vnmabus avatar Nov 01 '23 06:11 vnmabus

Ccing @shoyer / @Illviljan / @TomNicholas / @dcherian for visibility

andersy005 avatar Nov 02 '23 20:11 andersy005

needed it to be compatible with the array API standard.

It is ok for me if you compare dimension names for some operations (such as attempting to sum different things)

@vnmabus I may have misunderstood you but my understanding is that these are two conflicting requirements. Compatibility with the full API standard implies broadcasting by axis number, including in binary operations like sum. But if we compare dimension names to use those for broadcasting during a sum operation then that is a different behaviour, which is inconsistent with broadcasting by axis number.

To see what I mean consider summing two arrays, which are identical up to a transposition.

In [1]: import xarray as xr

In [2]: import numpy as np

In [3]: arr = np.array([[1, 2, 3], [4, 5, 6]])

In [4]: arr.shape
Out[4]: (2, 3)

In [5]: arr.T
Out[5]: 
array([[1, 4],
       [2, 5],
       [3, 6]])

In [6]: arr.T.shape
Out[6]: (3, 2)

In [7]: arr + arr.T
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 arr + arr.T

ValueError: operands could not be broadcast together with shapes (2,3) (3,2) 

In [8]: var = xr.Variable(data=arr, dims=['x', 'y'])

In [9]: var.shape
Out[9]: (2, 3)

In [10]: var.T
Out[10]: 
<xarray.Variable (y: 3, x: 2)>
array([[1, 4],
       [2, 5],
       [3, 6]])

In [11]: var.T.shape
Out[11]: (3, 2)

In [12]: var + var.T
Out[12]: 
<xarray.Variable (x: 2, y: 3)>
array([[ 2,  4,  6],
       [ 8, 10, 12]])

You can see that in numpy (broadcasting by axis number) then the array shapes (2,3), (3,2) are not broadcastable, but in xarray they are. (You could argue about the definition of the transpose here but I think then you would just end up with xarray breaking a different part of the array API standard.)

TomNicholas avatar Nov 02 '23 21:11 TomNicholas

Now perhaps we can re-explain @andersy005 's question above: operations on scalars are the only part of the array API where axis number & broadcasting rules are irrelevant, so it's the only part of the array API standard that xarray could obey. The question is should xarray obey only a part of the full standard, or is it better to either obey it all or none of it?

TomNicholas avatar Nov 02 '23 21:11 TomNicholas

It ideally should also support operations between NamedArrays and the underlying wrapped (unnamed) array type.

Finally you could imagine allowing this, but it would create a weird situation where xarray-only code was robust to transposition, but any operations between xarray and numpy objects were not invariant to transposition.

TomNicholas avatar Nov 02 '23 21:11 TomNicholas

Well, I was not aware of that behavior of xarray, and for me it seems dangerous, and not very consistent. What if I wanted to divide, for example, an array with a "distance" dimension and one with a "time" dimension. Would xarray do it or it would refuse as the dimensions do not match? What if the arrays have additional dimensions with similar names? For me that is not a desirable behavior.

What I like is simply an array type (supporting the full array API standard) that wraps other array types and adds named dimensions, that work across operations. It could be ok if they check that the dim name is the same for some operations, like adding two arrays, although even in that case it seems dubious to me (what if I wanted to sum dims "initialization time" and "computation time"? Honestly, I think that check should be done at the unit level and not at the name level).

vnmabus avatar Nov 03 '23 06:11 vnmabus

Well, I was not aware of that behavior of xarray, and for me it seems dangerous, and not very consistent. What if I wanted to divide, for example, an array with a "distance" dimension and one with a "time" dimension. Would xarray do it or it would refuse as the dimensions do not match? What if the arrays have additional dimensions with similar names? For me that is not a desirable behavior.

If you divide an array with a "distance" dimension by an array with a "time" dimension in Xarray, you get an array with two dimensions: "distance" and "time".

Some other operations might will raise an error with different dimension names, but in no cases will dimensions names be ignored.

What I like is simply an array type (supporting the full array API standard) that wraps other array types and adds named dimensions, that work across operations. It could be ok if they check that the dim name is the same for some operations, like adding two arrays, although even in that case it seems dubious to me (what if I wanted to sum dims "initialization time" and "computation time"? Honestly, I think that check should be done at the unit level and not at the name level).

I agree, this is certainly a reasonable data structure to build, but it isn't Xarray. This is basically the model used by PyTorch's Named Tensor.

shoyer avatar Nov 03 '23 06:11 shoyer

The question is should xarray obey only a part of the full standard, or is it better to either obey it all or none of it?

I guess it's worth considering it just in the sense that the function names are standardized so it's useful to use them for the sake of consistency across libraries (this even for functions that otherwise have an axis parameter).

It's likely that most nontrivial code written against the array API won't work with named arrays but perhaps there is some code that will. Perhaps it would be instructive to take a look at scipy and scikit-learn's array API code and see whether it would otherwise work with xarray.

asmeurer avatar Nov 03 '23 07:11 asmeurer

The question is should xarray obey only a part of the full standard, or is it better to either obey it all or none of it?

I could interpret this question in two ways:

  1. Is it useful to implement a subset of the functionality with the same syntax and semantics as the array API standard has for that functionality?
  2. Should we add a __array_namespace__ method to xarray.NamedArray to declare compatibility even though we're only partially compatible?

I think the answer to (1) is yes, why not - a lot of effort went into making every function in the standard have a clean signature and well-defined semantics, and comparing with many array libraries to make sure it matches them as well as possible (unless there's a good reason not to).

The answer to (2) should be no I believe. That will be incorrect, and for libraries that try to write array-library-agnostic code it may lead to silently incorrect results.

rgommers avatar Nov 07 '23 20:11 rgommers

That will be incorrect, and for libraries that try to write array-library-agnostic code it may lead to silently incorrect results.

We could raise for every op that is not an elementwise function but that seems against the spirit of the standard. Would you agree?

dcherian avatar Nov 08 '23 16:11 dcherian

Probably against the letter too, and not just the spirit - since there's many places where the wording is "must behave like X".

rgommers avatar Nov 08 '23 16:11 rgommers

Point taken! 😂

Thank you

dcherian avatar Nov 08 '23 17:11 dcherian

As I don't think there is anything more to discuss here as pertains to revising the array API standard, I'll go ahead and close this.

kgryte avatar Mar 21 '24 07:03 kgryte