scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

add `mask_obs`/`mask_var` arguments where appropriate

Open ivirshup opened this issue 3 years ago • 4 comments

I think we should introduce a standardized “mask” argument to scanpy functions. This would be a boolean array (or reference to a boolean array in obs/ var) which masks out certain data entries.

This can be thought of as a generalization of how highly variable genes is handled. As an example:

sc.pp.pca(adata, use_highly_variable=True)

Would be equivalent to:

sc.pp.pca(adata, mask="highly_variable")
# or
sc.pp.pca(adata, mask=adata.obs["highly_variable"])

One of the big advantages of making this more widespread is that tasks which previously required using .raw or creating new anndata objects will be much easier

Some uses for this change:

Plotting

A big one is plotting. Right now if you want to show gene expression for a subset of cells, you have to manually work with the Matplotlib Axes:

ax = sc.pl.umap(pbmc, show=False)
sc.pl.umap(
    pbmc[pbmc.obs["louvain"].isin(['CD4 T cells', 'B cells', 'CD8 T cells',])],
    color="LDHB",
    ax=ax,
)

If a user could provide a mask, this could be reduced, and would make plotting more than one value possible:

sc.pl.umap(
    pbmc,
    color=['LDHB', 'LYZ', 'CD79A’],
    mask=pbmc.obs["louvain"].isin(['CD4 T cells', 'B cells', 'CD8 T cells’,]),
)

Other uses

This has come up before in a few contexts:

  • Performing normalization on just some variables https://github.com/scverse/scanpy/issues/2142#issuecomment-1046729522
  • Selecting a subset of variables for DE tests: https://github.com/scverse/scanpy/issues/1744
    • See also https://github.com/scverse/scanpy/issues/748
  • Changing use_raw https://github.com/scverse/scanpy/issues/1798#issuecomment-819998988

Implementation

I think this could fit quite well into the sc.get getter/ validation functions (https://github.com/scverse/scanpy/issues/828#issuecomment-560072919).

ivirshup avatar Apr 13 '22 11:04 ivirshup

@Intron7 said he had experience with this and it’s a really good way to do things fast with dask etc.

  • works well in dask
  • scale is a good example how to do it
  • can’t do in 3rd party like PCA

flying-sheep avatar Jul 02 '24 12:07 flying-sheep

@Intron7 still waiting for your comment here!

flying-sheep avatar Aug 27 '24 08:08 flying-sheep

Yes, we already have a good mask for sparse scaling. Boolean arrays are very effective for indicating where computations should be performed, as they eliminate the need for copying and reintegration.

One clear example is the tl.score_genes function. masks there as booleans for the nanmean is a lot more efficent but less pythonic

Intron7 avatar Aug 27 '24 08:08 Intron7

Where does a mask make sense:

  • where is a mask better than just subsetting and then calling the function on the subset?
  • where is a use case for doing it on a subset (e.g. probably none for qc, right?
  • things that create an obsm/obsp entry would result in empty rows, but would there be a use case?
Function Should have mask Has mask Notes
pp.calculate_qc_metrics N N
pp.filter_cells/pp.filter_genes N N returns mask
pp.highly_variable_genes N N returns mask
pp.log1p Maybe N because scale has it?
pp.normalize_total Maybe N because scale has it?
pp.regress_out N N
pp.scale Y Y
pp.subsample Y N maybe as weighting[^1]
also add axis arg
pp.downsample_counts N N
pp.recipe_* N N
pp.combat Y N
pp.scrublet/pp.scrublet_* Maybe N maybe gene space
pp.neighbors Maybe N creates obsp entry
in relation to subset, like .fit(X).transform(Y)
pp.pca Y Y has mask_var
tl.tnse/tl.umap/tl.diffmap/tl.draw_graph Maybe N creates obsm entry
maybe as data source instead of n_pcs
tl.embedding_density N N
tl.leiden/tl.louvain Y Y as restrict_to[^2]
tl.dendrogram Y N probably useful to make dendrograms for multiple subsets?
tl.dpt Maybe N Ask people?
tl.paga N N
tl.ingest Maybe N might be useful to use only part as reference? maybe not.
tl.rank_genes_groups N N groups+reference fullfills that purpose
tl.filter_rank_genes_groups N N creates a mask
tl.marker_gene_overlap N N
tl.score_genes N N uses it internally, so probably don’t expose but refactor to use it[^3]
tl.score_genes_cell_cycle N N
get.aggregate Y Y

[^1]: column types: bool: subset; numeric: rel. weight per observer; cat: biased sampling [^2]: unify restrict_to and mask? [^3]: todo: refactor score_genes

flying-sheep avatar Nov 12 '24 11:11 flying-sheep