scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

add a mask argument

Open ivirshup opened this issue 2 years ago • 0 comments

I think we should introduce a standardized “mask” argument to scanpy functions. This would be a boolean array (or reference to a boolean array in obs/ var) which masks out certain data entries.

This can be thought of as a generalization of how highly variable genes is handled. As an example:

sc.pp.pca(adata, use_highly_variable=True)

Would be equivalent to:

sc.pp.pca(adata, mask=“highly_variable”)
# or
sc.pp.pca(adata, mask=adata.obs[“highly_variable”])

One of the big advantages of making this more widespread is that tasks which previously required using .raw or creating new anndata objects will be much easier

Some uses for this change:

Plotting

A big one is plotting. Right now if you want to show gene expression for a subset of cells, you have to manually work with the Matplotlib Axes:

ax = sc.pl.umap(pbmc, show=False)
sc.pl.umap(
    pbmc[pbmc.obs["louvain"].isin(['CD4 T cells', 'B cells', 'CD8 T cells',])],
    color="LDHB",
    ax=ax,
)

If a user could provide a mask, this could be reduced, and would make plotting more than one value possible:

sc.pl.umap(
    pbmc,
    color=['LDHB', 'LYZ', 'CD79A’],
    mask=pbmc.obs["louvain"].isin(['CD4 T cells', 'B cells', 'CD8 T cells’,]),
)

Other uses

This has come up before in a few contexts:

  • Performing normalization on just some variables https://github.com/scverse/scanpy/issues/2142#issuecomment-1046729522
  • Selecting a subset of variables for DE tests: https://github.com/scverse/scanpy/issues/1744
    • See also https://github.com/scverse/scanpy/issues/748
  • Changing use_raw https://github.com/scverse/scanpy/issues/1798#issuecomment-819998988

Implementation

I think this could fit quite well into the sc.get getter/ validation functions (https://github.com/scverse/scanpy/issues/828#issuecomment-560072919).

ivirshup avatar Apr 13 '22 11:04 ivirshup