anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Lambda indexer for anndata

Open grst opened this issue 2 years ago • 6 comments

With pandas data frames a pattern I use very commonly is

df_with_very_long_descriptive_name.loc[lambda x: x["fruit"] == "banana", :]

I was wondering if it would be possible to have lambda-based indexers for anndata as well, e.g.

adata_t = adata[lambda x: x.obs["cell_type"] == "T cell", :]

The benefits of this approach are imo:

  • no need for duplication of the object name (which becomes annoying if it is not just named adata)
  • reduction of copy&paste errors. I somewhat often end up with something like adata_t[adata.obs["something"], :], because I forget to update the second variable name.

grst avatar Apr 13 '22 17:04 grst

Why not simply

adata_t = adata[adata.obs.query("cell_type == ‘T cell’").index]

?

dawe avatar Apr 13 '22 18:04 dawe

Because i need to type adata twice in the same expression ;)

On Wed, Apr 13, 2022, 20:16 Davide Cittaro @.***> wrote:

Why not simply

adata_t = adata[adata.obs.query("cell_type == ‘T cell’").index]

?

— Reply to this email directly, view it on GitHub https://github.com/scverse/anndata/issues/758#issuecomment-1098346546, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZRV4LGBP3OQCRJ4JPMDTVE4FQTANCNFSM5TLNZS7Q . You are receiving this because you authored the thread.Message ID: @.***>

grst avatar Apr 13 '22 18:04 grst

Great idea, pipelining does need to be possible.

flying-sheep avatar Apr 19 '22 10:04 flying-sheep

I think this has been discussed before, and would definltley be in favor of something in this direction. I had thought maybe via a .select method.

I'm not sure I love lambda as the recommended way to do something like this though. It's only situationally more concise than typing adata twice, and I think we could ask get more out of alternative approaches.

I've been wondering about doing something more like polars or datafusion. These would go great with path based access too.

ivirshup avatar Apr 19 '22 17:04 ivirshup

It’s not verbosity, it’s about reusing an object in a pipeline without interrupting a pipleline and writing imperative code.

But yes, declarative code like polars is even better!

flying-sheep avatar Apr 20 '22 09:04 flying-sheep

Rough sketch of a potential API:

def select(
    self, 
    identifiers: Union[str, list[str]] = "*",
    *,
    obs: Optional[Idxer] = None,
    var: Optional[Idxer] = None,
    copy: bool = False,
):
    """
    Return a new AnnData with selected elements at selected indices.

    By default, does not copy data unless necessary.

    Usage
    -----

    >>> adata.select(
            ["obsm/X_pca", "obs/cell_type"],
            obs=po.col("cell_type") == "B Cell",
        )
    AnnData object with n_obs × n_vars = 342 × 13714
        obs: "cell_type”
        obsm: "X_pca"
    """
    ...

ivirshup avatar Apr 20 '22 14:04 ivirshup