Alignment with xarray

Open ivirshup opened this issue 2 years ago • 18 comments

I'm opening this issue to track and discuss how our data structure differs from xarray. Ideally I would close it when AnnData could easily be implemented via xarray.

Some previous discussion: #308

The idea

I often think of AnnData as a kind of "special case" of xarray Datasets. We just improve convenience by specializing on the 2d case, plus a few other features. It would be nice if I didn't just think of it that way, and we could actually just use their code here.

sgkit basically accomplishes this. It basically uses a very "anndata shaped"[^1] xarray Dataset[^2] for representing genomics data. These data structures and our goals with them are so similar that searching for open issues by the sgkit devs on the xarray repository is a great way to find compatibility issues for anndata.

Additionally, zarr and OME-zarr are quite aligned with xarray. (TODO: expand with context here)

What's missing

Some things we need, which xarray does not currently provide:

[ ] We have support for the fast sparse array library (ideally we can get pydata/sparse to become fast)
[ ] We support categorical variables
[ ] We support repeated dimensions (e.g. obsp, varp) https://github.com/pydata/xarray/issues/3731
[ ] We have a nested structure (though it's on the roadmap with datatree being implemented)
[ ] We are actively working on support for awkward arrays (https://github.com/theislab/anndata/pull/647 https://github.com/pydata/xarray/issues/4285, https://github.com/pystatgen/sgkit/issues/643)

[^1]: Since we're in the same language, working with biological data, and using many of the same technologies it would make a lot of sense for us to have greater alignment with sgkit. [^2]: More context: https://github.com/single-cell-data/matrix-api/issues/11#issuecomment-1072533371

Mar 23 '22 18:03 ivirshup

cc @jpivarski (who may be interested in the Awkward Array connection)

Apr 04 '22 18:04 jakirkham

Supporting Awkward Arrays would likely prevent full reimplementation of anndata with xarray alone, since xarrays can't contain Awkward Arrays or vice-versa. Even the "tree-like data structure" on xarray's road map (experimentally implemented by Datatree), is not quite the same thing, as Datatrees are more like nested groups in an HDF file (as seen in these docs): a small number of nested objects, which can each be large. Awkward Arrays represent a large number of nested objects. The comparison is like "AoS vs SoA" (just an analogy). This comment, https://github.com/pydata/xarray/issues/4118#issuecomment-1059382908, seems to be spelling out out the difference, and I'm following up with the author on https://github.com/scikit-hep/awkward-1.0/discussions/1396.

As a side note, it looks like there could be some benefit to xarrays containing Awkward Arrays (and not the other way around). That's something I should probably ask the xarray developers someday. Datatree is extending Dataset in a bigger way than it would probably take to wrap an Awkward Array.

Unless/until we actually do that, implementation of anndata with xarray would have to have some way to handle the fact that Awkward Arrays are not included within xarray's data model.

Apr 04 '22 20:04 jpivarski

Supporting Awkward Arrays would likely prevent full reimplementation of anndata with xarray alone, since xarrays can't contain Awkward Arrays or vice-versa. ... As a side note, it looks like there could be some benefit to xarrays containing Awkward Arrays

My mental model here was a 1d xr.DataArray containing an ak.Array. This seem's fairly doable to me since you really only need labels -> positional indices. Figuring out the merging/ concatenation semantics here could take some more doing, but also strikes me as possible.

Random thought: storing an arrow ListArray inside an xr.DataArray could get us part way here.

Apr 06 '22 16:04 ivirshup

Can you put Arrow data in xarray? Arrow is interchangeable with Awkward Array, so having Arrow can be seen as equivalent to having Awkward. The ak.to_arrow and ak.from_arrow functions are usually zero-copy, too. If that's already a possibility, it's more than part way there.

The main way in which Awkward Arrays differ from all the other array types is that Awkward Arrays do not have shape and dtype. (Same for Arrow arrays, for the same reason.) That's usually the first thing that we find when we attempt to put Awkward Arrays into Pandas or Dask naively. It's also why we can't participate in the Python array API standard.

A single ak.Array can be split apart into a small number of buffers of different sizes, each of which can be an xr.DataArray, along with some metadata to put them back again. That was the idea for using Awkward Array in Zarr: one ak.Array becomes one Zarr group of datasets. Since xarray Datatree is like Zarr and HDF5 groups, one ak.Array could be decomposed into a Datatree using ak.to_buffers and reconstituted using ak.from_buffers.

Apr 06 '22 18:04 jpivarski

The main way in which Awkward Arrays differ from all the other array types is that Awkward Arrays do not have shape and dtype. (Same for Arrow arrays, for the same reason.) That's usually the first thing that we find when we attempt to put Awkward Arrays into Pandas or Dask naively. It's also why we can't participate in the Python array API standard.

Bit of a tangent, but it might be worthwhile to write up a Data Array API issue about the Awkward Array use case.

Apr 06 '22 18:04 jakirkham

Bit of a tangent, but it might be worthwhile to write up a Data Array API issue about the Awkward Array use case.

We already talked about it here: https://github.com/data-apis/consortium-feedback/discussions/6. It sounded pretty clear that Awkward (and by extension, Arrow) are out of scope for Data Array API, and it's understandable that the scope would have to cut off somewhere.

Apr 06 '22 18:04 jpivarski

If anyone is looking for more confusion, I'd like to mention scipp, and in particular its Binned data feature. This is somewhat similar to a DataArray containing an Awkward Array of records. Happy to share more info if someone is interested.

May 31 '22 05:05 SimonHeybrock

@SimonHeybrock, thanks for pointing that out! From my initial look, the API for scipp looks quite nice. It does seem to cater to some use-cases we're looking at more than the more geospatial focus of xarray.

However, I really like that xarray can hold various types of python arrays. For instance, sparse arrays are very important to us – and I'd expect dask will become important as well.

Jun 07 '22 15:06 ivirshup

@ivirshup The two things you point out (holding other Python arrays, dask support) are indeed somewhat sore points for us. We would like to do both, but currently have no funding to do so.

We have serialization compatible with dask, so a number of the dask multi-processing APIs can be used, but we do not have an implementation of the dask collections interface, i.e., we currently do not support chunking and operations in the style of xarray's dask support.

Jun 08 '22 06:06 SimonHeybrock

Another potential ask here: not reading the dims (like indices of a dataframe) into memory Dataset declaration.

Jul 29 '23 17:07 ilan-gold

👋 Hi folks! Xarray dev here. Just wanted to drop a note to say that we'd be happy to help move this issue forward if/when it becomes a priority. We've been making lots of progress toward flexible indexes and array backends that I assume would be of interest here.

Sep 27 '23 16:09 jhamman

Hey @jhamman! I think it's pretty close to becoming a priority. Figuring out how heavy of a lift sparse arrays will be is the main thing here. Could you point me to any recent developments around array backends? Are we even talking like a-couple-hours-ago recent?

https://github.com/pydata/xarray/pull/8075

Sep 27 '23 17:09 ivirshup

Yes "couple of hours" recent. We will refactor out that NamedArray piece over the next couple of months to a new library with minimal dependencies (no pandas!) and support for any array API (+ other array protocols) compliant object.

Please read the design doc and let us know what you think. Your input will be very valuable!

Figuring out how heavy of a lift sparse arrays will be is the main thing here.

pydata/sparse is supported. scipy.sparse needs to become array API compliant (which I think is on the cards? you'll know more!). Bottom line is we want to support any standards-conforming array library.

From the list in your initial post though, it seems like NamedArray isn't entirely what you want.

For hierarchies you'd want datatree (as noted), but that pulls xarray, which will pull pandas.
We haven't considered repeated dims yet, but I bet we could support some set of reasonable cases.
Categorical variables are interesting. Again, if there was some array standard compliant container, we'd want to be able to wrap that too.

Sep 27 '23 21:09 dcherian

@dcherian You can see here roughly what we have working at the moment for categoricals: https://github.com/scverse/anndata/pull/947/files#diff-3593f379977a83708f011798996a4e97ec3cf87f11055e3f93651a9718ae4db2R34 We also have something for nullable data types as well. Feedback welcome!

Sep 28 '23 15:09 ilan-gold

Follow up on this topic at https://github.com/jpivarski/ragged/discussions/6

Dec 30 '23 18:12 jpivarski

Just as a note, the scope of the ragged library does not cover what we are currently doing in scirpy (heavy use of RecordTypes), nor for what @Zethson is planning in ehrapy (arbitrary nesting). So we'd likely need support for the full awkward array anyway.

Jan 03 '24 08:01 grst

Right—sorry for the confusion. If all the conversations linked to the new one, this one is perhaps the least related. I know that you've used missing data and even unions, which will not be supported by the ragged library.

Also, it's no minor thing that you've adapted AnnData to use Awkward: the work has been done. I think the users of the Ragged library would be wanting to make smaller changes to adopt something that looks like a normal array.

Jan 03 '24 13:01 jpivarski

All good! Thanks for keeping us in the loop of that discussion!

Jan 03 '24 13:01 grst

anndata anndata copied to clipboard

Alignment with xarray

The idea

What's missing

anndata
anndata copied to clipboard