xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Avoid calling np.asarray on lazy indexing classes

Open dcherian opened this issue 3 years ago • 1 comments

This is motivated by https://docs.rapids.ai/api/kvikio/stable/api.html#kvikio.zarr.GDSStore which on read loads the data directly into GPU memory.

Currently we rely on np.asarray to convert a BackendArray wrapped with a number of lazy indexing classes to a real array but this breaks for GDSStore because the underlying array is a cupy array, so using np.asarray raises an error.

np.asarray will raise if a non-numpy array is returned so we need to use something else.

Here I added get_array which like np.array recurses down until it receives a duck array.

Quite a few things are broken I think , but I'd like feedback on the approach.

I considered np.asanyarray(..., like=...) but that would require the lazy indexing classes to know what they're wrapping which doesn't seem right.

Ref: https://github.com/xarray-contrib/cupy-xarray/pull/10 which adds a kvikio backend entrypoint

dcherian avatar Aug 03 '22 15:08 dcherian

As I understand it, the main purpose here is to remove Xarray lazy indexing class.

Maybe call this get_duck_array(), just to be a little more descriptive?

shoyer avatar Aug 03 '22 17:08 shoyer

background

(Moving the convo out of a comment for visibility)

For reference, the code we would like is

        array = as_indexable(self.array)
        array = array[self.key]
        array = array.get_duck_array()

problem

So far I have found two fail cases

1. Wrapping a backend array instance

This method is removing LazilyIndexedArray. When we do that, we sometimes have another ExplicitlyIndexed array (_ElementwiseFunctionArray) and sometimes a BackendArray. We then apply array[self.key] which returns a ExplicitlyIndexed array for the former and forces a load from disk for the latter.

One way to solve this would be to return another ExplicitlyIndexed array from the getitem on the BackendArrays. This is currently forcing a load from disk:

I tried this by making __getitem__ on a BackendArrayWrapper return a ExplicitlyIndexed array (commit). This breaks all existing backends, and is a little complicated to add to the Zarr and Scipy backends.

2. Wrapping an IndexingAdapter instance

The other case is when as_indexable returns an IndexingAdapter instance. Again, these return duck arrays from __getitem__, so we can't call get_duck_array

https://github.com/pydata/xarray/blob/522ee2210499cfa434a0ca0825790949a8e08ff0/xarray/core/indexing.py#L671-L690

how to proceed?

@shoyer I'm not sure on how to proceed.

  1. For BackendArrays we could wrap them in another lazy indexing class when the Variable is created so we don't break our backend entrypoint contract (that __getitem__ can return a duck array). We might need something like this for TensorStore anyway.
  2. I don't know what to do about the indexing adapters.

dcherian avatar Jan 17 '23 18:01 dcherian

@Illviljan feel free to push any typing changes to this PR. I think that would really help clarify the interface. I tried adding a DuckArray type but that didn't go to far.

dcherian avatar Feb 16 '23 19:02 dcherian

I don't have a better idea than to do DuckArray = Any # ndarray/cupy/sparse etc. and add that as output, but that wont change anything mypy-wise besides making it easier for us to read the code.

Illviljan avatar Feb 16 '23 20:02 Illviljan

T_ExplicitlyIndexed may be a different thing to add

dcherian avatar Feb 16 '23 22:02 dcherian

I'd like to merge this at the end of next week.

It now has tests and should be backwards compatible with external backends.

A good next step would be to finish up #7020

dcherian avatar Mar 26 '23 19:03 dcherian