scirpy Easy-access getter functions

With the implementation of the new datastructure (#327), it becomes rather tricky to get information from the airr rearrangment schema (e.g. what is the "c_call" of the primary "VJ" chain?).

Previously this was possible simply with

adata.obs["IR_VJ_1_c_call"]

With the new datastructure

adata[:, "c_call"].X

just yields an awkward array with a variable number of chains per cell. The information which chain is VJ/VDJ and primary or secondary is hidden away in adata.obsm.

This motivates the implementation of long-proposed easy-access getter/setter functions. At the very least to

retrieve AIRR rearrangement variables for a certain chain.

But possibly also with convenience functions, e.g.

to retrieve the most abundant categories (previously discussed in #51).

The latter is of less importance, but the interface needs to be designed jointly, therefore this is also a topic in this issue.

To get AIRR data, we need something like

ir.get(adata, "locus", "VJ_1") -> pd.Series

We need it regulary to get the top n of a category, e.g. v-gene or clonotype.

This can be achieved in a pandas onliner, if one knows how to do it...

# This is probably hacky, we might think about a better way, but we need the most abundant clonotypes

top_clonotypes = adata.obs.clonotype.value_counts()[:8].index.tolist() # A better way might be needed especailly to take normalization into account
top_vgenes = adata.obs.TRB_1_v_gene.value_counts()[:8].index.tolist()

It would be more user-friendly to have a convenience function to this, for instance:

ir.tl.top_n(col="clonotype", n=10)

Use it for plotting:

sc.pl.umap(adata, color="clonotype", groups=ir.tl.top_n("clonotype", 10))

To discuss

[ ] retrieve multiple columns at once / vectorization?
[ ] what about "extra" chains?
[x] plotting

Sep 22 '20 09:09 grst

Suggestion:

Implement a "getter" class that supports method chaining.

putative API:

# get a series (n_obs, )
ir.get(adata).airr("locus", "VJ_1")

# get most frequent C gene
ir.get(adata).airr("c_call", "VJ_1").most_frequent(n=10)

# get most frequent clonotype
ir.get(adata).obs("clonotype").most_frequent(n=10)

# get top 3 clonotypes with the highest clonotype modularity score
ir.get(adata).obs("cc_aa_alignment").top_n(n=3, by="clonotype_modularity", aggr="unique")

Aug 18 '22 14:08 grst

Issues:

the top_n doesn't generalize well. What if you want an airr variable sorted by clonotype modularity? Or a column from another mudata modality? In most cases you'd be down to plain pandas again anyway, or the complexity of those function would increase drastically.
most_frequent is quite easy to get with plain pandas: series.value_counts()[:n].index.tolist(). Maybe this is an over-abstraction?
should getting AIRR variables be simpler (as this is the most common use-case)?. From obs you can retrieve stuff with direct indexing...

ir.get(adata, "locus", "VJ_1")

Aug 18 '22 14:08 grst

For the convenience functions

ir.get.most_frequent(adata.obs["clonotype"])

is more flexible, as it can be applied to any pandas series. Also one that was retrieved from the airr data, e.g

ir.get.most_frequent(ir.get.airr(adata, "c_call", "VJ_1")) # or
ir.get.airr(adata, "c_call", "VJ_1").pipe(ir.get.most_frequent, n=5)

Aug 18 '22 14:08 grst

Should get be a module, or a function?

I think ir.get as a module is flexible and future-proof. There might be more functions added in the future. Even if there will be no additional functions

ir.get.airr()

is not too hard to type and it is pretty explicit about what it's doing.

Aug 18 '22 14:08 grst

For easy plotting with scanpy functions, we could do one of the following.

Nothing. The user is free to assign a variable of interest to a column

adata.obs["locus_vj1"] = ir.get.airr(adata, "locus", "VJ_1")
adata.obs = adata.obs.assign(locus_vj = lambda x: ir.get.airr(x, "locus", "VJ_1")

Wrapper function that redirects to scanpy plotting functions

ir.pl.airr("umap", adata, "locus", "VJ_1")
ir.pl.airr(sc.pl.umap, adata, "locus", "VJ_1")
ir.pl.airr("umap", [("locus", "VJ_1"), ("locus", "VJ_2"), "some_other_var"])

# what is with other, non-embedding plotting functions?

Context manager

# the context manager temporarily adds columns and removes them again afterwards
with ir.get.anndata(adata, locus_vj=("locus", "vj_1")) as tmp_ad:
    sc.pl.umap(tmp_ad, color="locus_vj")

plot_dict = {
   "locus_vj1": ("locus", "vj_1"),
   "locus_vj2": ("locus", "vj_2"),
   "c_call_vj1": ("c_call", "vj_1"),
}
with ir.get.anndata(adata, **plot_dict) as tmp_ad:
   sc.pl.umap(tmp_ad, color=plot_dict.keys())

Aug 19 '22 07:08 grst

I like options (1) and (3), and they are not mutually exclusive.

(2) seems like it's to inflexible and leads to too many special cases.

Aug 19 '22 07:08 grst

The context manager and a get function are implemented in #356.

I decided against a top_n or most_frequent function as the same can be achieved quite easily with plain pandas/numpy.

Mar 21 '23 09:03 grst

scirpy scirpy copied to clipboard

Easy-access getter functions

scirpy
scirpy copied to clipboard