anndata
anndata copied to clipboard
Lambda indexer for anndata
With pandas data frames a pattern I use very commonly is
df_with_very_long_descriptive_name.loc[lambda x: x["fruit"] == "banana", :]
I was wondering if it would be possible to have lambda-based indexers for anndata as well, e.g.
adata_t = adata[lambda x: x.obs["cell_type"] == "T cell", :]
The benefits of this approach are imo:
- no need for duplication of the object name (which becomes annoying if it is not just named
adata) - reduction of copy&paste errors. I somewhat often end up with something like
adata_t[adata.obs["something"], :], because I forget to update the second variable name.
Why not simply
adata_t = adata[adata.obs.query("cell_type == ‘T cell’").index]
?
Because i need to type adata twice in the same expression ;)
On Wed, Apr 13, 2022, 20:16 Davide Cittaro @.***> wrote:
Why not simply
adata_t = adata[adata.obs.query("cell_type == ‘T cell’").index]
?
— Reply to this email directly, view it on GitHub https://github.com/scverse/anndata/issues/758#issuecomment-1098346546, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZRV4LGBP3OQCRJ4JPMDTVE4FQTANCNFSM5TLNZS7Q . You are receiving this because you authored the thread.Message ID: @.***>
Great idea, pipelining does need to be possible.
I think this has been discussed before, and would definltley be in favor of something in this direction. I had thought maybe via a .select method.
I'm not sure I love lambda as the recommended way to do something like this though. It's only situationally more concise than typing adata twice, and I think we could ask get more out of alternative approaches.
I've been wondering about doing something more like polars or datafusion. These would go great with path based access too.
It’s not verbosity, it’s about reusing an object in a pipeline without interrupting a pipleline and writing imperative code.
But yes, declarative code like polars is even better!
Rough sketch of a potential API:
def select(
self,
identifiers: Union[str, list[str]] = "*",
*,
obs: Optional[Idxer] = None,
var: Optional[Idxer] = None,
copy: bool = False,
):
"""
Return a new AnnData with selected elements at selected indices.
By default, does not copy data unless necessary.
Usage
-----
>>> adata.select(
["obsm/X_pca", "obs/cell_type"],
obs=po.col("cell_type") == "B Cell",
)
AnnData object with n_obs × n_vars = 342 × 13714
obs: "cell_type”
obsm: "X_pca"
"""
...