anndata icon indicating copy to clipboard operation
anndata copied to clipboard

first attempt to support awkward arrays

Open giovp opened this issue 2 years ago • 43 comments

First draft at supporting awkward arrays. As discussed with @ivirshup this would be useful for Squidpy @hspitzer (discussed also here #609 ) and potentially @Zethson EHR project.

Here's a walkthrough showcasing some basic functionality we could use for sub observation annotations (e.g. spatial coordinates of rna-molecules/segmentations).

Details
import numpy as np
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
from numpy.random import default_rng
from sklearn.datasets import make_blobs
import awkward as ak

adata = sq.datasets.visium_hne_adata()
adata = adata[:10, :].copy()
adata.obsm["spatial"] = (
    adata.obsm["spatial"] - np.std(adata.obsm["spatial"], 0)
) / np.mean(adata.obsm["spatial"], 0)

# generate data
obs_list = []
rng = default_rng(42)
for idx in adata.obs_names.values:
    coord, _ = make_blobs(
        n_samples=rng.integers(5, 15),
        cluster_std=0.02,
        centers=adata[idx].obsm["spatial"],
        random_state=42,
    )
    obs_list.append(coord)

sub_obs = ak.Array(obs_list)

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].set_aspect("equal")
ax[0].scatter(x=adata.obsm["spatial"][:, 0], y=adata.obsm["spatial"][:, 1])
for i in range(adata.shape[0]):
    ax[1].scatter(
        x=sub_obs[i, :, 0],
        y=sub_obs[i, :, 1],
    )
    ax[1].axis("equal")

image

adata.obsm["sub_obs"] = sub_obs  # sub_obs is an awkward array
adata_subset = adata[:5]  # let's subset, it also works for `.copy()`

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].set_aspect("equal")
ax[0].scatter(
    x=adata_subset.obsm["spatial"][:, 0], y=adata_subset.obsm["spatial"][:, 1]
)
for i in range(adata_subset.shape[0]):
    ax[1].scatter(
        x=adata_subset.obsm["sub_obs"][i, :, 0],
        y=adata_subset.obsm["sub_obs"][i, :, 1],
    )
    ax[1].set_aspect("equal")

image

What fails atm is adata.concatenate() because of errors in reindexing of alternate axis https://github.com/theislab/anndata/blob/286bc7f207863964cb861f17c96ab24fe0cf72ac/anndata/_core/merge.py#L478

in awkward arrays there is no shape attribute so when array.shape[0] is needed we can resort to len(awkward_array) but for array.shape[1] we should (probably) simply skip it and concatenate. In awkward it would be like this:

ak.concatenate([sub_obs, sub_obs])
>>> <Array [[[1.25, 0.833], ... [0.697, 1.22]]] type='20 * var * var * float64'>

TODO

  • [x] wait for #554 to get merged and work on IO
  • [x] switch to v2 api of awkward. v1 EOL is in four months and bugs are not getting fixed already now
  • [ ] address newly added TODOs in the code
  • [ ] Implement outer joins during concatenation. Either implement inner/outer join logic or raise a corresponding warning
  • [x] AwkwardArrayView with behavior

Tests

  • [x] test that assigning an awkward v1 array fails
  • [x] test the dim_len function
  • [ ] tests for concatenating awkward array with None or MissingVal
  • [x] test awkward support in .uns
  • [x] add specific test for more complex awkward arrays
  • [x] add element-wise IO tests (check different types e.g. union, optiontype, ...)
  • [x] Add tests for getting and setting awkward arrays (e.g. adata.obsm["x"] = awk)
  • [ ] Add tests for concatenating
    • [ ] Proper reindexing during concatenation
    • [ ] When a.obsm["x"] is a awkward array and b.obsm["x"] isn't (error?)
    • [ ] What's the right fill value for when join=outer and a.obsm["x"] exists but b.obsm["x"] doesn't
    • [ ] Are values being compared correctly when merge="equal"

Docs

  • [ ] add examples
  • [x] show which types of awkward arrays are supported (need to be "regular" in the aligned dimensions)

giovp avatar Nov 14 '21 11:11 giovp

Codecov Report

Merging #647 (e524389) into master (283b0c1) will increase coverage by 0.08%. The diff coverage is 88.54%.

@@            Coverage Diff             @@
##           master     #647      +/-   ##
==========================================
+ Coverage   83.12%   83.21%   +0.08%     
==========================================
  Files          34       34              
  Lines        4416     4503      +87     
==========================================
+ Hits         3671     3747      +76     
- Misses        745      756      +11     
Impacted Files Coverage Δ
anndata/compat/__init__.py 80.97% <37.50%> (-2.08%) :arrow_down:
anndata/_core/views.py 87.90% <75.00%> (-0.99%) :arrow_down:
anndata/utils.py 83.33% <80.00%> (-0.60%) :arrow_down:
anndata/_core/aligned_mapping.py 94.02% <100.00%> (+0.09%) :arrow_up:
anndata/_core/anndata.py 83.41% <100.00%> (ø)
anndata/_core/index.py 92.85% <100.00%> (+0.35%) :arrow_up:
anndata/_core/merge.py 93.76% <100.00%> (+0.31%) :arrow_up:
anndata/_io/specs/methods.py 84.53% <100.00%> (+0.75%) :arrow_up:
anndata/tests/helpers.py 95.63% <100.00%> (+0.19%) :arrow_up:

codecov[bot] avatar Nov 14 '21 11:11 codecov[bot]

Really cool @giovp and everyone involved. I can already see a couple of use-cases for ehrapy and am eager to see this move forward. Do you expect the support for akward array to also require modifications for all/many of Scanpys algorithms is this mostly a drop-in replacement?

CC @imipenem

Zethson avatar Nov 14 '21 12:11 Zethson

Do you expect the support for akward array to also require modifications for all/many of Scanpys algorithms is this mostly a drop-in replacement?

I doubt it, also can't think of any scanpy function that could use such representation out of the box. We'd have multiple use cases in squidpy though! Essentially being able to slice/subset/concatenate and copy anndata preserving the 0-axis of an akward array would cover most fo the cases I can think of right now.

giovp avatar Nov 14 '21 13:11 giovp

@giovp, for concat, AFAICT you can't have a multi-dimensional awkward array, so there won't be an "alternative" axis. Also, I don't think there is a logical way to concatenate an awkward array with any other kind of array, so it should probably just error. I think this could be handled similar to the all dataframe case.

Could you modify gen_adata in anndata.tests.helpers to add awkward arrays into obsm and varm? I think that'll help catch a lot of bugs.

ivirshup avatar Nov 15 '21 11:11 ivirshup

@giovp, for concat, AFAICT you can't have a multi-dimensional awkward array, so there won't be an "alternative" axis. Also, I don't think there is a logical way to concatenate an awkward array with any other kind of array, so it should probably just error. I think this could be handled similar to the all dataframe case.

yeah you can't have multi-dimensonal akward array, but I think it would still be good to concatenate them across axis=0, and so this should be supported and hence escape the current alterante_axes check?

Could you modify gen_adata in anndata.tests.helpers to add awkward arrays into obsm and varm? I think that'll help catch a lot of bugs.

I will do that and run test locally, sorry for that

giovp avatar Nov 15 '21 12:11 giovp

yeah you can't have multi-dimensonal akward array, but I think it would still be good to concatenate them across axis=0, and so this should be supported and hence escape the current alterante_axes check?

Ahh yeah, this is what I meant. Basically have a case for all elements being awkward arrays.

ivirshup avatar Nov 15 '21 13:11 ivirshup

Ok, at this stage concat works both for inner and outer on obsm, and subsetting works for varm.

Example
import numpy as np
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
from numpy.random import default_rng
from sklearn.datasets import make_blobs
import awkward as ak
import pandas as pd
from cycler import cycler

sc.set_figure_params()

adata = sq.datasets.visium_hne_adata()
varm = adata.obsm["spatial"][15:30, :]
adata = adata[:10, :15].copy()
adata.obsm["spatial"] = (
    adata.obsm["spatial"] - np.std(adata.obsm["spatial"], 0)
) / np.mean(adata.obsm["spatial"], 0)

adata.varm["spatial"] = varm.copy()
adata.varm["spatial"] = (
    adata.varm["spatial"] - np.std(adata.varm["spatial"], 0)
) / np.mean(adata.varm["spatial"], 0)

obs_list = []
var_list = []
rng = default_rng(42)
for idx in adata.obs_names.values:
    coord, _ = make_blobs(
        n_samples=rng.integers(5, 15),
        cluster_std=0.02,
        centers=adata[idx].obsm["spatial"],
        random_state=42,
    )
    obs_list.append(coord)

for idx in adata.var_names.values:
    coord, _ = make_blobs(
        n_samples=rng.integers(5, 15),
        cluster_std=0.02,
        centers=adata[:, idx].varm["spatial"],
        random_state=42,
    )
    var_list.append(coord)

sub_obs = ak.Array(obs_list)
sub_var = ak.Array(var_list)

def plot_points(adata, main: np.ndarray, sub: ak.Array, axis: int, cmap_name):

    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    ax[1].set_prop_cycle(
        cycler("color", plt.get_cmap(cmap_name)(np.linspace(0, 1, len(main))))
    )
    ax[0].axis("equal")
    ax[0].scatter(
        x=main[:, 0], y=main[:, 1], c="grey", edgecolors="black", s=100, linewidths=1
    )
    for i in range(adata.shape[axis]):
        ax[1].scatter(
            x=sub[i, :, 0],
            y=sub[i, :, 1],
            edgecolors="black",
            alpha=0.7,
        )
        ax[1].axis("equal")

    return

plot_points(adata, adata.obsm["spatial"], sub_obs, 0, "winter")
plot_points(adata, adata.varm["spatial"], sub_var, 1, "cool")

adata.obsm

image

adata.varm

image

Slicing and copying also works

adata.obsm["sub_obs"] = sub_obs  # sub_obs is an awkward array
adata.varm["sub_var"] = sub_var
adata_subset = adata[:5, :7]  # let's subset

If I run test locally with pytest anndata/tests no tests fails, not sure, where shall I start looking?

giovp avatar Nov 15 '21 20:11 giovp

I think you just need to add awkward array to the testing dependencies

ivirshup avatar Nov 15 '21 21:11 ivirshup

Also, for gen_adata to return an object with awkward arrays in it, you've got to generate them and place them here:

https://github.com/theislab/anndata/blob/286bc7f207863964cb861f17c96ab24fe0cf72ac/anndata/tests/helpers.py#L111-L121

ivirshup avatar Nov 15 '21 21:11 ivirshup

Also, for gen_adata to return an object with awkward arrays in it, you've got to generate them and place them here:

🤦 🤦 🤦

So I'm at stage where I could fix+add tests for awkward array for ages and am happy to do so but would like to get a sense of how welcomed this PR is. As it stands, we'd have to include awkward array as a (optional) dependency and it would be a non-negligible addition to the code base (although I expected worse, with singledispatch made it quite slim).

A major thing that could be impactful is that there is no .view() or .copy() notion in awkward array, see here and here.

Overall, I think we could use it right away in Squidpy, mostly as a way to enable anndata slicing and indexing while retaining sub-obs or sub-var info. We would not really use it to do any arithmetics or stuff like that, although it would be cool to come up with function ideas and also the fact that it support numba and jax jitting is cool.

So, to summarize, shall I go ahead? @ivirshup @hspitzer @Zethson @michalk8 ?

giovp avatar Nov 16 '21 12:11 giovp

I'm all for it. Having more record like data has been requested a number of times.

I would note that there are some things in the awkward array api that will change (https://github.com/scikit-hep/awkward-1.0/discussions/1151), but that's mostly down the line stuff.

A major thing that could be impactful is that there is no .view() or .copy() notion in awkward array

I believe it does have copy, just not as a method. We might be able to get them to add that?

ivirshup avatar Nov 17 '21 10:11 ivirshup

TO DO:

  • [ ] wait for #554 to get merged and work on IO
  • [ ] add specific test for more complex awkward arrays

giovp avatar Nov 29 '21 13:11 giovp

@grst, if you wanted to try having variable length values per cell, this PR is starting to look pretty ready. Would appreciate any early feedback!

ivirshup avatar Mar 08 '22 14:03 ivirshup

I'm just catching up to all of this, but I can answer a few questions now.

I believe it does have copy, just not as a method. We might be able to get them to add that?

There's also __copy__. We want to minimize the number of methods on ak.Array to avoid conflicts with methods that downstream libraries add through the ak.behavior mechanism.

Do you want it to deeply copy (__deepcopy__)? __copy__ only makes a new Python object wrapping the tree of data buffers; it doesn't recurse down that tree or replace any of the buffers.

For what reason do you need to copy? That would be relevant if you're going to change a buffer in place—to avoid changes in other arrays that share the same buffers. (Awkward Array operations freely share buffers as views because all of the operations treat arrays as immutable.) I don't know of other reasons to copy without changing anything.

jpivarski avatar Mar 24 '22 16:03 jpivarski

I'd like to know more about this, though it's a lot to take in at once. Would you like to talk sometime so that I can learn more about anndata and you can ask questions about Awkward?

jpivarski avatar Mar 24 '22 17:03 jpivarski

I would note that there are some things in the awkward array api that will change (scikit-hep/awkward-1.0#1151), but that's mostly down the line stuff.

In the three levels of abstraction for Awkward:

  1. high-level, what data analysts see, ak.Array, ak.some_function, ak.some_other_function, slices, applying NumPy ufuncs, high-level Types, etc.
  2. mid-level, what projects like this one building on top of Awkward's public API see, such as .layout and Forms (finer granularity than Types),
  3. low-level, Awkward internals,

The high-level API is almost entirely unchanged, and the mid-level has some minor renamings for consistency and laziness (VirtualArray and PartitionedArray) are being removed in favor of dask-awkward. As long as you don't depend on lazy arrays, the transition will be easy or automatic.

You'll probably want to pin this implementation to awkward>=1,<2 so that if there are any method name changes needed for v2, anndata won't accidentally get an incompatible version of Awkward. After updating anndata, it can then be pinned to awkward>=2,<3. An alternative would be to try to support both with one version of anndata, which wouldn't be crazy, given the small scope of renamed functions. (We do similar things in our dependence on NumPy: it changes slowly enough that we can detect when names change like tostringtobytes.)

Another alternative would be to just aim for version 2 now:

import awkward._v2 as ak

The last time I checked (by counting number of lines of code expected), the version 2 submodule that we're shipping inside Awkward 1.8.0 is 95% complete.

jpivarski avatar Mar 24 '22 17:03 jpivarski

@jpivarski, sorry for the late response here! We've got a heavy mix of vacations, workshops, and deadlines over here at the moment 😅!


For your questions about copying, I don't think I understood the immutable part of awkward array at the time I made that comment. We are copying to protect against in place modifications. So, a shallow copy here should be fine.


I'd like to know more about this, though it's a lot to take in at once. Would you like to talk sometime so that I can learn more about anndata and you can ask questions about Awkward?

This sounds great. I'd particularly like to talk about on disk formats to figure out if we can re-use the representation for some new types (like per-observation polygons).

Let me talk with @giovp about availability and I'll follow up by email.


Another alternative would be to just aim for version 2 now

I'd probably lean towards this. Especially since we're still figuring out how exactly to use it.

ivirshup avatar Apr 04 '22 14:04 ivirshup

I started playing around with this and it looks really neat! :rocket:

I was wondering if it is planned to support obs with only some "awkward" columns?

adata = AnnData(
    obs=pd.DataFrame(index=["cell1", "cell2"]).assign(
        awkward_col=[ak.Array([1, 2, 3]), ak.Array([4, 5])],
        normal_col=list("AB")
    )
)
> adata.obs
      awkward_col normal_col
cell1   [1, 2, 3]          A
cell2      [4, 5]          B

I can create this object, but saving fails with

Stacktrace
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/projects/2022/anndata/anndata/_io/utils.py:214, in report_write_key_on_error.<locals>.func_wrapper(elem, key, val, *args, **kwargs)
    213 try:
--> 214     return func(elem, key, val, *args, **kwargs)
    215 except Exception as e:

File ~/projects/2022/anndata/anndata/_io/specs/registry.py:171, in write_elem(f, k, elem, modifiers, *args, **kwargs)
    167 if (
    168     hasattr(elem, "dtype")
    169     and (dest_type, (t, elem.dtype.kind), modifiers) in _REGISTRY.write
    170 ):
--> 171     _REGISTRY.get_writer(dest_type, (t, elem.dtype.kind), modifiers)(
    172         f, k, elem, *args, **kwargs
    173     )
    174 else:

File ~/projects/2022/anndata/anndata/_io/specs/registry.py:24, in write_spec.<locals>.decorator.<locals>.wrapper(g, k, *args, **kwargs)
     22 @wraps(func)
     23 def wrapper(g, k, *args, **kwargs):
---> 24     result = func(g, k, *args, **kwargs)
     25     g[k].attrs.setdefault("encoding-type", spec.encoding_type)

File ~/projects/2022/anndata/anndata/_io/specs/methods.py:347, in write_vlen_string_array(f, k, elem, dataset_kwargs)
    346 str_dtype = h5py.special_dtype(vlen=str)
--> 347 f.create_dataset(k, data=elem.astype(str_dtype), dtype=str_dtype, **dataset_kwargs)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/h5py/_hl/group.py:161, in Group.create_dataset(self, name, shape, dtype, data, **kwds)
    159         group = self.require_group(parent_path)
--> 161 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    162 dset = dataset.Dataset(dsid)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/h5py/_hl/dataset.py:159, in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, dapl, efile_prefix, virtual_prefix, allow_unknown_filter)
    158 if (data is not None) and (not isinstance(data, Empty)):
--> 159     dset_id.write(h5s.ALL, h5s.ALL, data)
    161 return dset_id

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File h5py/h5d.pyx:232, in h5py.h5d.DatasetID.write()

File h5py/_proxy.pyx:145, in h5py._proxy.dset_rw()

File h5py/_conv.pyx:444, in h5py._conv.str2vlen()

File h5py/_conv.pyx:95, in h5py._conv.generic_converter()

File h5py/_conv.pyx:249, in h5py._conv.conv_str2vlen()

TypeError: Can't implicitly convert non-string objects to strings

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Input In [56], in <cell line: 1>()
----> 1 adata.write_h5ad("test.h5ad")

File ~/projects/2022/anndata/anndata/_core/anndata.py:1918, in AnnData.write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   1915 if filename is None:
   1916     filename = self.filename
-> 1918 _write_h5ad(
   1919     Path(filename),
   1920     self,
   1921     compression=compression,
   1922     compression_opts=compression_opts,
   1923     force_dense=force_dense,
   1924     as_dense=as_dense,
   1925 )
   1927 if self.isbacked:
   1928     self.file.filename = filename

File ~/projects/2022/anndata/anndata/_io/h5ad.py:98, in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
     96 elif adata.raw is not None:
     97     write_elem(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
---> 98 write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
     99 write_elem(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    100 write_elem(f, "obsm", dict(adata.obsm), dataset_kwargs=dataset_kwargs)

File ~/projects/2022/anndata/anndata/_io/utils.py:214, in report_write_key_on_error.<locals>.func_wrapper(elem, key, val, *args, **kwargs)
    211 @wraps(func)
    212 def func_wrapper(elem, key, val, *args, **kwargs):
    213     try:
--> 214         return func(elem, key, val, *args, **kwargs)
    215     except Exception as e:
    216         if "Above error raised while writing key" in format(e):

File ~/projects/2022/anndata/anndata/_io/specs/registry.py:175, in write_elem(f, k, elem, modifiers, *args, **kwargs)
    171     _REGISTRY.get_writer(dest_type, (t, elem.dtype.kind), modifiers)(
    172         f, k, elem, *args, **kwargs
    173     )
    174 else:
--> 175     _REGISTRY.get_writer(dest_type, t, modifiers)(f, k, elem, *args, **kwargs)

File ~/projects/2022/anndata/anndata/_io/specs/registry.py:24, in write_spec.<locals>.decorator.<locals>.wrapper(g, k, *args, **kwargs)
     22 @wraps(func)
     23 def wrapper(g, k, *args, **kwargs):
---> 24     result = func(g, k, *args, **kwargs)
     25     g[k].attrs.setdefault("encoding-type", spec.encoding_type)
     26     g[k].attrs.setdefault("encoding-version", spec.encoding_version)

File ~/projects/2022/anndata/anndata/_io/specs/methods.py:544, in write_dataframe(f, key, df, dataset_kwargs)
    541 write_elem(group, index_name, df.index._values, dataset_kwargs=dataset_kwargs)
    542 for colname, series in df.items():
    543     # TODO: this should write the "true" representation of the series (i.e. the underlying array or ndarray depending)
--> 544     write_elem(group, colname, series._values, dataset_kwargs=dataset_kwargs)

File ~/projects/2022/anndata/anndata/_io/utils.py:220, in report_write_key_on_error.<locals>.func_wrapper(elem, key, val, *args, **kwargs)
    218 else:
    219     parent = _get_parent(elem)
--> 220     raise type(e)(
    221         f"{e}\n\n"
    222         f"Above error raised while writing key {key!r} of {type(elem)} "
    223         f"to {parent}"
    224     ) from e

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'awkward_col' of <class 'h5py._hl.group.Group'> to /

Using an awkward array instead of a dataframe for obs doesn't work either:

adata = AnnData(
    obs=ak.Array({"awkward_col": [[1, 2, 3], [4, 5]], "normal_col": list("AB")})
)
Stacktrace
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [58], in <cell line: 1>()
----> 1 adata = AnnData(
      2     obs=ak.Array({"awkward_col": [[1,2,3], [4,5]], "normal_col": list("AB")})
      3 )

File ~/projects/2022/anndata/anndata/_core/anndata.py:291, in AnnData.__init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    289     self._init_as_view(X, oidx, vidx)
    290 else:
--> 291     self._init_as_actual(
    292         X=X,
    293         obs=obs,
    294         var=var,
    295         uns=uns,
    296         obsm=obsm,
    297         varm=varm,
    298         raw=raw,
    299         layers=layers,
    300         dtype=dtype,
    301         shape=shape,
    302         obsp=obsp,
    303         varp=varp,
    304         filename=filename,
    305         filemode=filemode,
    306     )

File ~/projects/2022/anndata/anndata/_core/anndata.py:497, in AnnData._init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    494                 raise ValueError("`shape` is inconsistent with `var`")
    496 # annotations
--> 497 self._obs = _gen_dataframe(obs, self._n_obs, ["obs_names", "row_names"])
    498 self._var = _gen_dataframe(var, self._n_vars, ["var_names", "col_names"])
    500 # now we can verify if indices match!

File ~/anaconda3/envs/test_awkward/lib/python3.9/functools.py:888, in singledispatch.<locals>.wrapper(*args, **kw)
    884 if not args:
    885     raise TypeError(f'{funcname} requires at least '
    886                     '1 positional argument')
--> 888 return dispatch(args[0].__class__)(*args, **kw)

File ~/projects/2022/anndata/anndata/_core/anndata.py:114, in _gen_dataframe(anno, length, index_names)
    108     if index_name in anno:
    109         return pd.DataFrame(
    110             anno,
    111             index=anno[index_name],
    112             columns=[k for k in anno.keys() if k != index_name],
    113         )
--> 114 return pd.DataFrame(anno, index=pd.RangeIndex(0, length, name=None).astype(str))

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/pandas/core/frame.py:708, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    705 if not isinstance(data, (abc.Sequence, ExtensionArray)):
    706     if hasattr(data, "__array__"):
    707         # GH#44616 big perf improvement for e.g. pytorch tensor
--> 708         data = np.asarray(data)
    709     else:
    710         data = list(data)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/highlevel.py:1351, in Array.__array__(self, *args, **kwargs)
   1326 def __array__(self, *args, **kwargs):
   1327     """
   1328     Intercepts attempts to convert this Array into a NumPy array and
   1329     either performs a zero-copy conversion or raises an error.
   (...)
   1349     cannot be sliced as dimensions.
   1350     """
-> 1351     return ak._connect._numpy.convert_to_array(self.layout, args, kwargs)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/_connect/_numpy.py:13, in convert_to_array(layout, args, kwargs)
     12 def convert_to_array(layout, args, kwargs):
---> 13     out = ak.operations.convert.to_numpy(layout, allow_missing=False)
     14     if args == () and kwargs == {}:
     15         return out

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/operations/convert.py:329, in to_numpy(array, allow_missing)
    327 if array.numfields == 0:
    328     return numpy.empty(len(array), dtype=[])
--> 329 contents = [
    330     to_numpy(array.field(i), allow_missing=allow_missing)
    331     for i in range(array.numfields)
    332 ]
    333 if any(len(x.shape) != 1 for x in contents):
    334     raise ValueError(
    335         f"cannot convert {array} into np.ndarray"
    336         + ak._util.exception_suffix(__file__)
    337     )

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/operations/convert.py:330, in <listcomp>(.0)
    327 if array.numfields == 0:
    328     return numpy.empty(len(array), dtype=[])
    329 contents = [
--> 330     to_numpy(array.field(i), allow_missing=allow_missing)
    331     for i in range(array.numfields)
    332 ]
    333 if any(len(x.shape) != 1 for x in contents):
    334     raise ValueError(
    335         f"cannot convert {array} into np.ndarray"
    336         + ak._util.exception_suffix(__file__)
    337     )

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/operations/convert.py:324, in to_numpy(array, allow_missing)
    321     return out[: shape[0] * array.size].reshape(shape)
    323 elif isinstance(array, ak._util.listtypes):
--> 324     return to_numpy(array.toRegularArray(), allow_missing=allow_missing)
    326 elif isinstance(array, ak._util.recordtypes):
    327     if array.numfields == 0:

ValueError: in ListOffsetArray64, cannot convert to RegularArray because subarray lengths are not regular

(https://github.com/scikit-hep/awkward-1.0/blob/1.8.0/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

The same works flawlessly for .obsm, also with saving

adata = AnnData(
    obs=pd.DataFrame(index=["cell1", "cell2"]),
    obsm={"awkward": ak.Array({"awkward_col": [[1, 2, 3], [4, 5]], "normal_col": list("AB")})}
)
adata.write_h5ad("test.h5ad")
adata = anndata.read_h5ad("test.h5ad")
adata.obsm["awkward"]
<Array [{awkward_col: [1, ... normal_col: 'B'}] type='2 * {"awkward_col": var * ...'>

grst avatar Jun 19 '22 19:06 grst

This is very interesting! We had been planning to reintroduce "Awkward Array as a Pandas column" by creating a new library that is external to awkward itself, so that

  • awkward would not require pandas as a dependency
  • the wrapper can extend pandas.api.extensions.ExtensionArray
  • the many methods and properties ExtensionArray requires would be in the wrapper, not ak.Array itself (ak.Array shouldn't, for example, have a shape)

I knew that you were planning to add Awkward Arrays as an array type to AnnData, but I didn't realize that it would be through Pandas (something that I didn't know about AnnData). Maybe this is how we can satisfy this requirement, and promote AnnData as the way to get Awkward Arrays in Pandas?

Regarding the usages above,

awkward_col=[ak.Array([1, 2, 3]), ak.Array([4, 5])]

is not scalable: this length-2 list represents something that scales with the size of the dataset, and ak.Array has a large per-object overhead. (The tree describing the data structures in the array scales with the complexity of the data type, but not with the number of entries in the data. So if you have a lot of ak.Array objects, that type description is duplicated, but if you have one ak.Array object with lots of entries, it's not. The same could be said of NumPy arrays, which have shape, strides, flags, etc. for every array object, but ak.Arrays are more complex than np.ndarrays.)

obs=ak.Array({"awkward_col": [[1, 2, 3], [4, 5]], "normal_col": list("AB")})

or

awkward_col=ak.Array([[1, 2, 3], [4, 5]]), normal_col=list("AB")

would be the right way to do it. In the first of these, "obs" is the only column[^1] and the "awkward_col"/"normal_col" structure is inside the ak.Array; in the second, "awkward_col" and "normal_col" are columns of the AnnData. For ak.Array, whether the internal structure is lists of numbers or records whose fields are lists of numbers and letters is an internal matter.

The first stacktrace seems to be in AnnData; its attempt to find an h5py writer for ak.Array fails when—maybe—some identifying code thinks it's a string and then the writing code realizes it's not a string? Awkward Arrays can be written to HDF5, but they have to be explicitly deconstructed and reconstructed, as described here. Or they could be pickled, and Awkward version 2 arrays (import awkward._v2 as ak, which I recommend developing on because version 1 will be dropped in December) supports Python 3.8+'s pickle protocol 5 with out-of-band data (a feature inherited through NumPy—they did all the work). ak.Array.__setstate__/__getstate__ (pickle) is using the method described in the "how to save to HDF5" documentation, linked above.

The second stacktrace comes from awkward itself; it's an indication that ak.Array.__array__ is being called on data that can't be converted to NumPy. The array that @grst created,

>>> array = ak.Array({"awkward_col": [[1, 2, 3], [4, 5]], "normal_col": list("AB")})
>>> array
<Array [{awkward_col: [1, ... normal_col: 'B'}] type='2 * {"awkward_col": var * ...'>

is perfectly valid and is a better pattern than [ak.Array([1, 2, 3]), ak.Array([4, 5])]. However,

>>> np.asarray(array)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/highlevel.py", line 1351, in __array__
    return ak._connect._numpy.convert_to_array(self.layout, args, kwargs)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_connect/_numpy.py", line 13, in convert_to_array
    out = ak.operations.convert.to_numpy(layout, allow_missing=False)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/operations/convert.py", line 329, in to_numpy
    contents = [
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/operations/convert.py", line 330, in <listcomp>
    to_numpy(array.field(i), allow_missing=allow_missing)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/operations/convert.py", line 324, in to_numpy
    return to_numpy(array.toRegularArray(), allow_missing=allow_missing)
ValueError: in ListOffsetArray64, cannot convert to RegularArray because subarray lengths are not regular

The subarray lengths are not regular: the first entry of "awkward_col" has length 3 and the second has length 2.

>>> array["awkward_col", 0]
<Array [1, 2, 3] type='3 * int64'>
>>> array["awkward_col", 1]
<Array [4, 5] type='2 * int64'>

By contrast, it would work if the ak.Array happened to be regular:

>>> regular_array = ak._v2.Array([[1, 2], [4, 5]])
>>> np.asarray(regular_array)
array([[1, 2],
       [4, 5]])

but those are only the cases in which you could be using NumPy instead of Awkward Array, anyway.

It's interesting that .obsm (mapping) does work. It must be avoiding the cast-as-NumPy that .obs is hitting. Maybe it's pickling the values?

[^1]: I'm reading up on AnnData, having just realized that obs (and obsm) are special names with special meanings, not arbitrary column names.

jpivarski avatar Jun 20 '22 13:06 jpivarski

Thanks for the detailed write up, as always, @jpivarski.

It would be super great if columns in dataframes could be ragged arrays via awkward array. I wonder how exactly you would want to manage this though. Would you want "nested columns"? Or should those be exposed at separate columns in the dataframe?

It's interesting that .obsm (mapping) does work. It must be avoiding the cast-as-NumPy that .obs is hitting. Maybe it's pickling the values?

We dispatch our writers on the type of the element passed. Since we can store awkward arrays in obsm, we can dispatch the writers. When I try to put an awkward array in a dataframe, it is converted to a object array. The only kind of object dtype array we know how to write is arrays of strings, so an error is thrown.

If there was an pandas extension type for awkward arrays, this wouldn't be an issue.

One big use case for ragged columns in dataframes for us would be lists of genomic ranges, like bioconductors GRangesList object.

ivirshup avatar Jun 20 '22 14:06 ivirshup

This is an example of how an Awkward Array can be wrapped as a Pandas column:

https://github.com/martindurant/awkward_extras/tree/main/awkward_pandas

(See also AwkwardSeries and tests.) Pandas ExtensionArrays can only be one-dimensional, so we would give it the ak.Array as an opaque object. Pandas doesn't get to see if it has any record fields—what might be considered "nested columns."

Actually, what would be best for you—to internalize the Awkward-Pandas wrapping into AnnData or to have it be a separate project that you can use? (The latter option would be three libraries: awkward, awkward-pandas, and anndata.)

@martindurant's code would need a little modernizing—it's from Oct 2020—and it can take advantage of @douglasdavis's https://github.com/ContinuumIO/dask-awkward/ .

jpivarski avatar Jun 20 '22 23:06 jpivarski

That looks quite promising! Thanks for pointing it out.

I think it would make more sense externally. I can imagine use cases where people would want ragged arrays in their dataframe, but not AnnData.

ivirshup avatar Jun 21 '22 11:06 ivirshup

There are obviously some thing that don't translate too well from pandas to an awkward extension array (groupby and any sorting spring to mind). Also, we ought to make an effort that any selection on an awkward column that results in one value per row should probably become a Series, and any operation resulting in a non-nested record per row could optionally become a dataframe. A direct .to_dataframe for awkward array of- or a single record may already exist.

martindurant avatar Jun 24 '22 18:06 martindurant

The pandas extension sounds great! It also sounds like a lot of work though :/

How hard would it be to implement awkward array support for .X? In https://github.com/scverse/scirpy/issues/327#issuecomment-1172977130 it was suggested to store AIRR data in .X. When combining this with mudata, this would actually make a lot of sense IMO.

grst avatar Jul 05 '22 11:07 grst

In scverse/scirpy#327 (comment) it was suggested to store AIRR data in .X.

Following this comment, I gather that what you mean by ".X" is a BumpyMatrix, which is a 3-dimensional array, ragged in the third dimension. (I had been expecting likelihood distributions over alleles to be represented this way, and the ability to mix regular and ragged list types in Awkward was added specifically in anticipation of someday dealing with genomes.) Also from the comment, you want this to be equivalent to a Pandas MultiIndex of three levels: ['cell_id', 'locus', 'sequence_id']. We can do that.

Here's a generic 3-dimensional ragged array; the type is 3 * var * var * float64, meaning that the array has length 3 and contains variable-length lists of variable-length lists of floats. The ak.to_pandas function converts the raggedness into MultiIndex levels. This isn't the "Pandas wrapping" we had been talking about because it breaks the array down into scalar Pandas cells, leaving the structure in the index, and we had been talking about wrapping the Awkward Array in a way that the variable-length lists themselves can be put into Pandas cells non-destructively. But sometimes, you want one thing, and sometimes, you want the other, and this one has already been implemented.

>>> import awkward._v2 as ak   # v2 repr looks better; all of this works with v1

>>> bumpy = ak.Array([[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [], [[5.5], [6.6, 7.7, 8.8, 9.9]]])
>>> bumpy
<Array [[[0, 1.1, 2.2], ..., [3.3, ...]], ...] type='3 * var * var * float64'>

>>> bumpy.show()
[[[0, 1.1, 2.2], [], [3.3, 4.4]],
 [],
 [[5.5], [6.6, 7.7, 8.8, 9.9]]]

>>> ak.to_pandas(bumpy, levelname=lambda i: ["cell_id", "locus", "sequence_id"][i])
                           values
cell_id locus sequence_id        
0       0     0               0.0
              1               1.1
              2               2.2
        2     0               3.3
              1               4.4
2       0     0               5.5
        1     0               6.6
              1               7.7
              2               8.8
              3               9.9

This can also be done for data with raggedness in only one of the dimensions. Here, I construct an array with lists and use ak.to_regular to make one of the dimensions regular: the type is 2 * 3 * var * float64. If you're coming from a specialized file format that has raggedness built into only one dimension, then we can construct this type directly. (Ragged dimensions need an index to say where each element begins, but regular dimensions are computed the same way as a NumPy stride, and therefore uses less memory.)

>>> somewhat_bumpy = ak.to_regular(
...     ak.Array([[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [[5.5], [6.6, 7.7, 8.8, 9.9], []]]),
...     axis=1,
... )
>>> somewhat_bumpy
<Array [[[0, 1.1, 2.2], [], [3.3, ...]], ...] type='2 * 3 * var * float64'>

>>> somewhat_bumpy.show()
[[[0, 1.1, 2.2], [], [3.3, 4.4]],
 [[5.5], [6.6, 7.7, 8.8, 9.9], []]]

>>> ak.to_pandas(somewhat_bumpy, levelname=lambda i: ["cell_id", "locus", "sequence_id"][i])
                           values
cell_id locus sequence_id        
0       0     0               0.0
              1               1.1
              2               2.2
        2     0               3.3
              1               4.4
1       0     0               5.5
        1     0               6.6
              1               7.7
              2               8.8
              3               9.9

This method could be hard to name when there are also Awkward Arrays directly in Pandas columns: maybe it should be called to_exploded_pandas or similar? At present, it's the only way to turn Awkward Arrays into Pandas DataFrames.

jpivarski avatar Jul 05 '22 14:07 jpivarski

Hi @jpivarski,

thanks for another extensive explanation. This actually goes one step further in the discussion than I had expected, but this helps a lot for the implementation of the AIRR datastructure. This looks pretty much like what I need!

Following this comment, I gather that what you mean by ".X"

With .X I was merely referring to the "main data matrix" slot in AnnData. Currently, this can only be a two dimensional numpy array (variables x observations). This is most typically a gene expression matrix, but may contain other "omics" data as we are moving to multimodal data. My question was more directed to @giovp or @ivirshup how hard it would be to support an awkward array instead of a numpy array in this slot. In my case, the awkward array would indeed have two fixed and one variable dimension, s.t. it could be aligned to an .obs and a .var data frame and have the usual .var_names and .obs_names indices.


For reference, here's an overview of the AnnData data structure: image

  • obs/var are pandas data frames aligned to the two axes of the .X array
  • obsm/varm are numpy arrays with one dimension aligned with the two axes of the .X array
  • .var_names and .obs_names are aliases for .var.index and .obs.index, respectively

The current draft implementation in this PR only supports awkward arrays in .obsm.

grst avatar Jul 05 '22 15:07 grst

@jpivarski, one more thing I haven't fully understood is how an ak.Array compares to an ak.Record.

E.g. AIRR Rearrangement data is essentially a list of chain-dictionaries for each cell. Every cell can have 0-n chains, every chain 1-n fields, but every chain has the same fields. Every field can have different data types.

dict_cell_chain_field = {
    "cell1": [
        {"locus": "TRA", "v_call": "TRAV1", "cdr3_length": 15}, # <-- represents one chain
        {"locus": "TRB", "v_call": "TRBV1", "cdr3_length": 12},
    ],
    "cell2": [{"locus": "TRA", "v_call": "TRAV1", "cdr3_length": 13}],
    "cell3": [],
}

Now I could represent this as a record

r = ak.Record(dict_cell_chain_field)

which works and preserves the field names. However, while I can make the first dimension regular

r = ak.to_regular(r, 0)

I can't make the third dimension regular (although all dicts have the same fields)

# AttributeError: 'awkward._ext.Record' object has no attribute 'axis_wrap_if_negative'
r = ak.to_regular(r, 2)
Stacktrace
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [127], in <cell line: 1>()
----> 1 ak.to_regular(r_cell_field_chain, 1)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/operations/structure.py:800, in to_regular(array, axis, highlevel, behavior)
    798 out = ak.operations.convert.to_layout(array)
    799 if axis != 0:
--> 800     out = ak._util.recursively_apply(
    801         out,
    802         getfunction,
    803         pass_depth=True,
    804         pass_user=True,
    805         user=axis,
    806         numpy_to_regular=True,
    807     )
    809 return ak._util.maybe_wrap_like(out, array, behavior, highlevel)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/_util.py:1236, in recursively_apply(layout, getfunction, pass_depth, pass_user, user, keep_parameters, numpy_to_regular)
   1231     else:
   1232         return transform_child_layouts(
   1233             transform, layout, depth, user=custom, keep_parameters=keep_parameters
   1234         )
-> 1236 return transform(layout, 1, user)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/_util.py:1227, in recursively_apply.<locals>.transform(layout, depth, user)
   1224 if pass_user:
   1225     args = args + (user,)
-> 1227 custom = getfunction(layout, *args)
   1228 if callable(custom):
   1229     return custom()

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/awkward/operations/structure.py:786, in to_regular.<locals>.getfunction(layout, depth, posaxis)
    785 def getfunction(layout, depth, posaxis):
--> 786     posaxis = layout.axis_wrap_if_negative(posaxis)
    787     if posaxis == depth and isinstance(layout, ak.layout.RegularArray):
    788         return lambda: layout

AttributeError: 'awkward._ext.Record' object has no attribute 'axis_wrap_if_negative'

Alternatively, I could represent the same data as ak.Array

list_cell_chain_field = [
    [["TRA", "TRAV1", 15], ["TRB", "TRBV1", 12]],
    [["TRA", "TRAV1", 13]],
    [],
]
a = ak.Array(list_cell_chain_field)
# this works
a = ak.to_regular(a, 0)
a = ak.to_regular(a, 2)

(Eventually, I would probably want to switch the second and third dimension here, s.t. the first two dimensions are fixed)


  • Are there any benefits of one over the other?
  • Should AnnData support both records and arrays? I mean, named records are nice, but then again the names in the first (or first two) dimensions are handled through obs and var metadata anyway.

grst avatar Jul 18 '22 12:07 grst

Benefits of one over the other: if "cell1, cell2, cell3..." are part of a sequence, especially if it's a dimension that can grow large, then it should be an ak.Array, rather than an ak.Record. There is an internal distinction between them: all fields of a record are in distinct buffers, but the data in an array consists of contiguous data. Fields of a record are managed by Python code, whereas loops over an array are computed in compiled code. Fields of a record can all have distinct data types, whereas data in an array must have a single type. If all cells of an array would always have the same type (as in your example above), I think you'd want them to be in an ak.Array.

Nevertheless, the axis_wrap_if_negative error is an oversight in Awkward Array; I've filed a bug report: https://github.com/scikit-hep/awkward/issues/1551 Even if some of these cases aren't meaningful (e.g. axis=0 on an ak.Record, since a record is a scalar, there's no top-level array to pass the function across), they should have error messages that say so, instead of revealing that we implemented this method on array-like data and not record-like data.

jpivarski avatar Jul 18 '22 14:07 jpivarski

Codecov Report

Merging #647 (3883bb0) into master (511cdba) will increase coverage by 0.59%. The diff coverage is 91.55%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #647      +/-   ##
==========================================
+ Coverage   83.69%   84.29%   +0.59%     
==========================================
  Files          34       34              
  Lines        4508     4718     +210     
==========================================
+ Hits         3773     3977     +204     
- Misses        735      741       +6     
Impacted Files Coverage Δ
anndata/compat/__init__.py 86.20% <37.50%> (-2.35%) :arrow_down:
anndata/_core/file_backing.py 90.24% <85.71%> (-0.55%) :arrow_down:
anndata/utils.py 84.66% <86.00%> (+0.59%) :arrow_up:
anndata/_core/views.py 88.95% <93.54%> (+0.98%) :arrow_up:
anndata/_core/aligned_mapping.py 94.00% <93.75%> (-0.14%) :arrow_down:
anndata/_core/merge.py 94.22% <95.34%> (+0.31%) :arrow_up:
anndata/tests/helpers.py 96.01% <97.43%> (+0.22%) :arrow_up:
anndata/__init__.py 100.00% <100.00%> (ø)
anndata/_core/anndata.py 82.81% <100.00%> (ø)
anndata/_core/index.py 94.61% <100.00%> (+1.81%) :arrow_up:
... and 6 more

codecov[bot] avatar Jul 19 '22 07:07 codecov[bot]

Tests currently fail (among other things) because https://github.com/scikit-hep/awkward/issues/1557 leads to inconsistent dimensions during concatenation. Need to update to v2 API of awkward and wait for this to be fixed.

I collected the TODOs scattered over this thread in the top-level comment.

grst avatar Jul 20 '22 06:07 grst