Avoid calling np.asarray on lazy indexing classes
This is motivated by https://docs.rapids.ai/api/kvikio/stable/api.html#kvikio.zarr.GDSStore which on read loads the data directly into GPU memory.
Currently we rely on np.asarray to convert a BackendArray wrapped with a number of lazy indexing classes to a real array but this breaks for GDSStore because the underlying array is a cupy array, so using np.asarray raises an error.
np.asarray will raise if a non-numpy array is returned so we need to use something else.
Here I added get_array which like np.array recurses down until it receives a duck array.
Quite a few things are broken I think , but I'd like feedback on the approach.
I considered np.asanyarray(..., like=...) but that would require the lazy indexing classes to know what they're wrapping which doesn't seem right.
Ref: https://github.com/xarray-contrib/cupy-xarray/pull/10 which adds a kvikio backend entrypoint
As I understand it, the main purpose here is to remove Xarray lazy indexing class.
Maybe call this get_duck_array(), just to be a little more descriptive?
background
(Moving the convo out of a comment for visibility)
For reference, the code we would like is
array = as_indexable(self.array)
array = array[self.key]
array = array.get_duck_array()
problem
So far I have found two fail cases
1. Wrapping a backend array instance
This method is removing LazilyIndexedArray. When we do that, we sometimes have another ExplicitlyIndexed array (_ElementwiseFunctionArray) and sometimes a BackendArray. We then apply array[self.key] which returns a ExplicitlyIndexed array for the former and forces a load from disk for the latter.
One way to solve this would be to return another ExplicitlyIndexed array from the getitem on the BackendArrays. This is currently forcing a load from disk:
I tried this by making __getitem__ on a BackendArrayWrapper return a ExplicitlyIndexed array (commit). This breaks all existing backends, and is a little complicated to add to the Zarr and Scipy backends.
2. Wrapping an IndexingAdapter instance
The other case is when as_indexable returns an IndexingAdapter instance. Again, these return duck arrays from __getitem__, so we can't call get_duck_array
https://github.com/pydata/xarray/blob/522ee2210499cfa434a0ca0825790949a8e08ff0/xarray/core/indexing.py#L671-L690
how to proceed?
@shoyer I'm not sure on how to proceed.
- For BackendArrays we could wrap them in another lazy indexing class when the Variable is created so we don't break our backend entrypoint contract (that
__getitem__can return a duck array). We might need something like this for TensorStore anyway. - I don't know what to do about the indexing adapters.
@Illviljan feel free to push any typing changes to this PR. I think that would really help clarify the interface. I tried adding a DuckArray type but that didn't go to far.
I don't have a better idea than to do DuckArray = Any # ndarray/cupy/sparse etc. and add that as output, but that wont change anything mypy-wise besides making it easier for us to read the code.
T_ExplicitlyIndexed may be a different thing to add
I'd like to merge this at the end of next week.
It now has tests and should be backwards compatible with external backends.
A good next step would be to finish up #7020