scirpy icon indicating copy to clipboard operation
scirpy copied to clipboard

Easy-access getter functions

Open grst opened this issue 5 years ago • 6 comments

With the implementation of the new datastructure (#327), it becomes rather tricky to get information from the airr rearrangment schema (e.g. what is the "c_call" of the primary "VJ" chain?).

Previously this was possible simply with

adata.obs["IR_VJ_1_c_call"]

With the new datastructure

adata[:, "c_call"].X

just yields an awkward array with a variable number of chains per cell. The information which chain is VJ/VDJ and primary or secondary is hidden away in adata.obsm.

This motivates the implementation of long-proposed easy-access getter/setter functions. At the very least to

  • retrieve AIRR rearrangement variables for a certain chain.

But possibly also with convenience functions, e.g.

  • to retrieve the most abundant categories (previously discussed in #51).

The latter is of less importance, but the interface needs to be designed jointly, therefore this is also a topic in this issue.


To get AIRR data, we need something like

ir.get(adata, "locus", "VJ_1") -> pd.Series

We need it regulary to get the top n of a category, e.g. v-gene or clonotype.

This can be achieved in a pandas onliner, if one knows how to do it...

# This is probably hacky, we might think about a better way, but we need the most abundant clonotypes

top_clonotypes = adata.obs.clonotype.value_counts()[:8].index.tolist() # A better way might be needed especailly to take normalization into account
top_vgenes = adata.obs.TRB_1_v_gene.value_counts()[:8].index.tolist()

It would be more user-friendly to have a convenience function to this, for instance:

ir.tl.top_n(col="clonotype", n=10)

Use it for plotting:

sc.pl.umap(adata, color="clonotype", groups=ir.tl.top_n("clonotype", 10))

To discuss

  • [ ] retrieve multiple columns at once / vectorization?
  • [ ] what about "extra" chains?
  • [x] plotting

grst avatar Sep 22 '20 09:09 grst

Suggestion:

Implement a "getter" class that supports method chaining.

putative API:

# get a series (n_obs, )
ir.get(adata).airr("locus", "VJ_1")

# get most frequent C gene
ir.get(adata).airr("c_call", "VJ_1").most_frequent(n=10)

# get most frequent clonotype
ir.get(adata).obs("clonotype").most_frequent(n=10)

# get top 3 clonotypes with the highest clonotype modularity score
ir.get(adata).obs("cc_aa_alignment").top_n(n=3, by="clonotype_modularity", aggr="unique")

grst avatar Aug 18 '22 14:08 grst

Issues:

  • the top_n doesn't generalize well. What if you want an airr variable sorted by clonotype modularity? Or a column from another mudata modality? In most cases you'd be down to plain pandas again anyway, or the complexity of those function would increase drastically.

  • most_frequent is quite easy to get with plain pandas: series.value_counts()[:n].index.tolist(). Maybe this is an over-abstraction?

  • should getting AIRR variables be simpler (as this is the most common use-case)?. From obs you can retrieve stuff with direct indexing...

ir.get(adata, "locus", "VJ_1")

grst avatar Aug 18 '22 14:08 grst

For the convenience functions

ir.get.most_frequent(adata.obs["clonotype"])

is more flexible, as it can be applied to any pandas series. Also one that was retrieved from the airr data, e.g

ir.get.most_frequent(ir.get.airr(adata, "c_call", "VJ_1")) # or
ir.get.airr(adata, "c_call", "VJ_1").pipe(ir.get.most_frequent, n=5)

grst avatar Aug 18 '22 14:08 grst

Should get be a module, or a function?

I think ir.get as a module is flexible and future-proof. There might be more functions added in the future. Even if there will be no additional functions

ir.get.airr()

is not too hard to type and it is pretty explicit about what it's doing.

grst avatar Aug 18 '22 14:08 grst

For easy plotting with scanpy functions, we could do one of the following.

  1. Nothing. The user is free to assign a variable of interest to a column
adata.obs["locus_vj1"] = ir.get.airr(adata, "locus", "VJ_1")
adata.obs = adata.obs.assign(locus_vj = lambda x: ir.get.airr(x, "locus", "VJ_1")
  1. Wrapper function that redirects to scanpy plotting functions
ir.pl.airr("umap", adata, "locus", "VJ_1")
ir.pl.airr(sc.pl.umap, adata, "locus", "VJ_1")
ir.pl.airr("umap", [("locus", "VJ_1"), ("locus", "VJ_2"), "some_other_var"])

# what is with other, non-embedding plotting functions? 
  1. Context manager
# the context manager temporarily adds columns and removes them again afterwards
with ir.get.anndata(adata, locus_vj=("locus", "vj_1")) as tmp_ad:
    sc.pl.umap(tmp_ad, color="locus_vj")

plot_dict = {
   "locus_vj1": ("locus", "vj_1"),
   "locus_vj2": ("locus", "vj_2"),
   "c_call_vj1": ("c_call", "vj_1"),
}
with ir.get.anndata(adata, **plot_dict) as tmp_ad:
   sc.pl.umap(tmp_ad, color=plot_dict.keys())

grst avatar Aug 19 '22 07:08 grst

I like options (1) and (3), and they are not mutually exclusive.

(2) seems like it's to inflexible and leads to too many special cases.

grst avatar Aug 19 '22 07:08 grst

The context manager and a get function are implemented in #356.

I decided against a top_n or most_frequent function as the same can be achieved quite easily with plain pandas/numpy.

grst avatar Mar 21 '23 09:03 grst