anndata
anndata copied to clipboard
Lambda indexer for anndata
With pandas data frames a pattern I use very commonly is
df_with_very_long_descriptive_name.loc[lambda x: x["fruit"] == "banana", :]
I was wondering if it would be possible to have lambda-based indexers for anndata as well, e.g.
adata_t = adata[lambda x: x.obs["cell_type"] == "T cell", :]
The benefits of this approach are imo:
- no need for duplication of the object name (which becomes annoying if it is not just named
adata
) - reduction of copy&paste errors. I somewhat often end up with something like
adata_t[adata.obs["something"], :]
, because I forget to update the second variable name.
Why not simply
adata_t = adata[adata.obs.query("cell_type == ‘T cell’").index]
?
Because i need to type adata
twice in the same expression ;)
On Wed, Apr 13, 2022, 20:16 Davide Cittaro @.***> wrote:
Why not simply
adata_t = adata[adata.obs.query("cell_type == ‘T cell’").index]
?
— Reply to this email directly, view it on GitHub https://github.com/scverse/anndata/issues/758#issuecomment-1098346546, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZRV4LGBP3OQCRJ4JPMDTVE4FQTANCNFSM5TLNZS7Q . You are receiving this because you authored the thread.Message ID: @.***>
Great idea, pipelining does need to be possible.
I think this has been discussed before, and would definltley be in favor of something in this direction. I had thought maybe via a .select
method.
I'm not sure I love lambda
as the recommended way to do something like this though. It's only situationally more concise than typing adata
twice, and I think we could ask get more out of alternative approaches.
I've been wondering about doing something more like polars
or datafusion
. These would go great with path based access too.
It’s not verbosity, it’s about reusing an object in a pipeline without interrupting a pipleline and writing imperative code.
But yes, declarative code like polars is even better!
Rough sketch of a potential API:
def select(
self,
identifiers: Union[str, list[str]] = "*",
*,
obs: Optional[Idxer] = None,
var: Optional[Idxer] = None,
copy: bool = False,
):
"""
Return a new AnnData with selected elements at selected indices.
By default, does not copy data unless necessary.
Usage
-----
>>> adata.select(
["obsm/X_pca", "obs/cell_type"],
obs=po.col("cell_type") == "B Cell",
)
AnnData object with n_obs × n_vars = 342 × 13714
obs: "cell_type”
obsm: "X_pca"
"""
...