scirpy
scirpy copied to clipboard
Easy-access getter functions
With the implementation of the new datastructure (#327), it becomes rather tricky to get information from the airr rearrangment schema (e.g. what is the "c_call" of the primary "VJ" chain?).
Previously this was possible simply with
adata.obs["IR_VJ_1_c_call"]
With the new datastructure
adata[:, "c_call"].X
just yields an awkward array with a variable number of chains per cell. The information which
chain is VJ/VDJ and primary or secondary is hidden away in adata.obsm.
This motivates the implementation of long-proposed easy-access getter/setter functions. At the very least to
- retrieve AIRR rearrangement variables for a certain chain.
But possibly also with convenience functions, e.g.
- to retrieve the most abundant categories (previously discussed in #51).
The latter is of less importance, but the interface needs to be designed jointly, therefore this is also a topic in this issue.
To get AIRR data, we need something like
ir.get(adata, "locus", "VJ_1") -> pd.Series
We need it regulary to get the top n of a category, e.g. v-gene or clonotype.
This can be achieved in a pandas onliner, if one knows how to do it...
# This is probably hacky, we might think about a better way, but we need the most abundant clonotypes
top_clonotypes = adata.obs.clonotype.value_counts()[:8].index.tolist() # A better way might be needed especailly to take normalization into account
top_vgenes = adata.obs.TRB_1_v_gene.value_counts()[:8].index.tolist()
It would be more user-friendly to have a convenience function to this, for instance:
ir.tl.top_n(col="clonotype", n=10)
Use it for plotting:
sc.pl.umap(adata, color="clonotype", groups=ir.tl.top_n("clonotype", 10))
To discuss
- [ ] retrieve multiple columns at once / vectorization?
- [ ] what about "extra" chains?
- [x] plotting
Suggestion:
Implement a "getter" class that supports method chaining.
putative API:
# get a series (n_obs, )
ir.get(adata).airr("locus", "VJ_1")
# get most frequent C gene
ir.get(adata).airr("c_call", "VJ_1").most_frequent(n=10)
# get most frequent clonotype
ir.get(adata).obs("clonotype").most_frequent(n=10)
# get top 3 clonotypes with the highest clonotype modularity score
ir.get(adata).obs("cc_aa_alignment").top_n(n=3, by="clonotype_modularity", aggr="unique")
Issues:
-
the
top_ndoesn't generalize well. What if you want an airr variable sorted by clonotype modularity? Or a column from another mudata modality? In most cases you'd be down to plain pandas again anyway, or the complexity of those function would increase drastically. -
most_frequentis quite easy to get with plain pandas:series.value_counts()[:n].index.tolist(). Maybe this is an over-abstraction? -
should getting AIRR variables be simpler (as this is the most common use-case)?. From obs you can retrieve stuff with direct indexing...
ir.get(adata, "locus", "VJ_1")
For the convenience functions
ir.get.most_frequent(adata.obs["clonotype"])
is more flexible, as it can be applied to any pandas series. Also one that was retrieved from the airr data, e.g
ir.get.most_frequent(ir.get.airr(adata, "c_call", "VJ_1")) # or
ir.get.airr(adata, "c_call", "VJ_1").pipe(ir.get.most_frequent, n=5)
Should get be a module, or a function?
I think ir.get as a module is flexible and future-proof. There might be more functions added in the future.
Even if there will be no additional functions
ir.get.airr()
is not too hard to type and it is pretty explicit about what it's doing.
For easy plotting with scanpy functions, we could do one of the following.
- Nothing. The user is free to assign a variable of interest to a column
adata.obs["locus_vj1"] = ir.get.airr(adata, "locus", "VJ_1")
adata.obs = adata.obs.assign(locus_vj = lambda x: ir.get.airr(x, "locus", "VJ_1")
- Wrapper function that redirects to scanpy plotting functions
ir.pl.airr("umap", adata, "locus", "VJ_1")
ir.pl.airr(sc.pl.umap, adata, "locus", "VJ_1")
ir.pl.airr("umap", [("locus", "VJ_1"), ("locus", "VJ_2"), "some_other_var"])
# what is with other, non-embedding plotting functions?
- Context manager
# the context manager temporarily adds columns and removes them again afterwards
with ir.get.anndata(adata, locus_vj=("locus", "vj_1")) as tmp_ad:
sc.pl.umap(tmp_ad, color="locus_vj")
plot_dict = {
"locus_vj1": ("locus", "vj_1"),
"locus_vj2": ("locus", "vj_2"),
"c_call_vj1": ("c_call", "vj_1"),
}
with ir.get.anndata(adata, **plot_dict) as tmp_ad:
sc.pl.umap(tmp_ad, color=plot_dict.keys())
I like options (1) and (3), and they are not mutually exclusive.
(2) seems like it's to inflexible and leads to too many special cases.
The context manager and a get function are implemented in #356.
I decided against a top_n or most_frequent function as the same can be achieved quite easily with plain pandas/numpy.