scirpy icon indicating copy to clipboard operation
scirpy copied to clipboard

scverse datastructure for AIRR data

Open grst opened this issue 3 years ago • 41 comments

Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at https://github.com/theislab/scanpy/issues/1387.

The challenge with scAIRR data is that

  • 1 cell can have n chains. Up to four of them are biologically meaningful but there could be more for technical reasons.
  • Each chain has a lot of fields. See the AIRR rearrangement standard.

The current pragmatic solution is to store all fields in adata.obs.

  • All columns from the airr rearrangement schema are repeated four times
  • Excess chains are serialized into JSON and stored in an extra column. These chains are not used by scirpy, but enable lossless conversions.
  • The downside is that there can easily be 100+ columns in adata.obs. Also serializing excess chains is not really elegant.
  • The advantage is that it works really well with scanpy, i.e. any AIRR variable can immediately be used for grouping, plotting etc.

New options are

  • mudata. AIRR data could be saved as a separate modality. Even if we keep the current reprepsentation of a wide data frame, it would at least not clutter the rest of adata.obs.
  • awkward array support. Allows storing an arbitrary number of values per row. See https://github.com/theislab/anndata/pull/647

The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.

  • [x] what's the state of the AIRR single-cell schema? And what are its advantates over the rearrangement schema.

grst avatar Mar 09 '22 11:03 grst

Hi All, I am active in the AIRR Standards group, can try and answer questions. We are looking at integrating scirpy into the iReceptor Gateway (gateway.ireceptor.org) to do analysis on AIRR Cell data - so have a vested interest... 8-)

The AIRR Cell schema was just announce at the AIRR meeting this week (https://www.antibodysociety.org/the-airr-community/meetings/airr-community-meeting-vi-exploring-new-frontiers/), with a tagged (v1.4) release expected on github in the next few weeks. This is an "Experimental" schema, meaning that we think it is pretty stable, but will probably need some minor changes before it is finalized, probably in an AIRR Standard v2.0 release sometime in the "near" future.

bcorrie avatar May 20 '22 17:05 bcorrie

The Cell schema is intended to represent a Cell as an object. It is a light weight object, and contains a cell_id that is used to cross reference information in other collections such as Rearrangement and CellExpression. So if you have a cell_id from a Cell you can find all of the Rearrangements for the paired chains associated with the cell in an AIRR TSV file by searching for that same cell_id. Similarly you can find all of the CellExpression information by searching a CellExpression file for that same cell_id.

bcorrie avatar May 20 '22 17:05 bcorrie

Here are the current docs on the master branch, which will be tagged with a v1.4 release soon.

https://docs.airr-community.org/en/latest/datarep/cell.html

bcorrie avatar May 20 '22 17:05 bcorrie

One of the outstanding things we need to resolve in the AIRR Standard is what if any "efficient" file format should be supported. Currently the default standard for all AIRR objects is a JSON representation of the standard objects. For Rearrangement there is a defined TSV file. Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

We have an about to be release version of our AIRR repository (https://github.com/sfu-ireceptor/turnkey-service-php/tree/production-v4) that can store Cell and CellExpression data, but currently what you get out is a very big, verbose JSON file. It works, but...

bcorrie avatar May 20 '22 18:05 bcorrie

BTW, I think it would be interesting to decorate a cell's adata.obs with AIRR Repertoire metadata. That is, it would be quite easy (I think) to decorate the adata.obs with all sorts of subject/sample metadata for each cell such as subject.disease.disease_daignosis, subject.age_min, subject.sex, sample.tissue etc so that one can use these criteria for coloring, group, etc as well. It seems to me that these fields in adata.obs would be more valuable than the most of the Rearrangement fields. I could see wanting some Rearrangement fields, maybe VDJ calls, CDR3 length, locus for the paired chains, but it seems like most of the Rearrangement fields would not be very interesting from a slicing/coloring/visualization point of view.

bcorrie avatar May 20 '22 18:05 bcorrie

Hi @bcorrie,

thanks for reaching out, I'm definitely interested in working with the AIRR community to make this as interoperable as possible. Your description of the Cell schema is really helpful, before I wasn't sure it was to replace the Rearrangement schema, but now I understood it's built on top of it.

Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of AirrCell objects that have a cell_id, a list of Rearrangement entries and arbitrary cell-level attributes that are added as columns to adata.obs. Once the Cell schema is final, I could simply adjust my AirrCell implementation to match the cell-schema, which would make reading and writing AIRR-compliant data trivial.

The internal representation is one side of the medal, the other is representing the AIRR data as part of either an AnnData or MuData container, which allows to link the data to other modalities, allows efficient file storage in h5ad/h5mu and enables the interaction with scanpy or other scverse tools. The current way is to dump everything in .obs which is not ideal.


Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

Well, how about AnnData? If you are looking for a lower level of abstraction, I think hdf5, zarr or tiledb can handle these sparse data efficiently, but @ivirshup or @gtca are the experts here. Is it really necessary to define yet another format for single-cell expression?

Note that reading/writing gene expression formats would be out-of-scope for scirpy, this should be handled by scanpy or AnnData, or possibly by a future scverse IO package.

It's also worth noting here that the CZI is currently looking into standardizing Feature Observation Matrices (FOM) for single-cell data. Maybe it would be worth for the AIRR community and them getting in touch for talking about a AIRR-specific extension. Their public presence is still very sparse, but maybe @ivirshup can tell more about this initiative and make the contact.

Best, Gregor

grst avatar May 21 '22 14:05 grst

Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

Well, how about AnnData? If you are looking for a lower level of abstraction, I think hdf5, zarr or tiledb can handle these sparse data efficiently, but @ivirshup or @gtca are the experts here. Is it really necessary to define yet another format for single-cell expression?

Yes, AnnData seems logical and along the lines of what I was thinking... Definitely do not want YAF (yet another format) 8-)

Like you, this is likely out of scope for the AIRR Community - at least in the sense that the obvious thing for us to do would be to have the AIRR python library (https://github.com/airr-community/airr-standards/tree/master/lang/python) use scanpy to read and write these formats.

The CZI FOM is interesting as well...

bcorrie avatar May 24 '22 23:05 bcorrie

Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of AirrCell objects that have a cell_id, a list of Rearrangement entries and arbitrary cell-level attributes that are added as columns to adata.obs. Once the Cell schema is final, I could simply adjust my AirrCell implementation to match the cell-schema, which would make reading and writing AIRR-compliant data trivial.

Very nice - one would hope that these would be closely aligned, and hopefully we can now make sure they are completely aligned 8-)

bcorrie avatar May 24 '22 23:05 bcorrie

Draft proposal for an awkward-array-based data structure

I've been playing around with the draft implementation of awkward array support in AnnData (https://github.com/scverse/anndata/pull/647). This allows to store in .obsm (or in the future maybe also in .obs, see the comments there) ragged arrays of mixed types for each cell -- exactely what we need to represent the 1:n relationship of cells with chains.

An "awkward array" can be directly created from a list of AirrRearrangement dictionaries, e.g.

import awkward as ak
import scirpy as ir
airr_cells = ir.io.to_airr_cells(adata)
chains_as_awkarr = ak.Array([cell.chains for cell in airr_cells])
# an array of length 3000 (= number of cells), variable width (= number of chains) and 
# AIRR Rearrangement fields on the 3rd dimension
> chains_as_awkarr
<Array [[{c_call: 'TRBC2', ... v_cigar: None}]] type='3000 * var * {"c_call": op...'>

To get, e.g. all junction_aa sequences of the first chain of each cell, this awkward array can be sliced:

> chains_as_awkarr[:, 0, "junction_aa"]
<Array ['CASSLMRLAGDTQYF', ... 'CATEEYGNKLVF'] type='3000 * string'>

It is therefore straightforward and computationally efficient to retrieve individual elements.

In summary: We simply take AIRR compliant data, put it in an efficient data structure and store it an AnnData. Not a lot new to define from our side

Caveats and possible solutions

1. Primary and Secondary chains

The list of chains does not know the concept of primary and secondary chains. This can be solved by adding an index array:

# the index arrays point to the element in the list of chains
primary_vj = [0, 0, 1, 2, 0, 1, ...]
primary_vdj = [1, 3, 5, 3, 2, ...]

The awkward array can then be sliced

> chains_as_awkarr[:, primary_vj, "junction_aa"]

to retrieve e,g. the junction AA sequences for the primary VJ chains.

2. Getting values for plotting

The current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway.

If they are, here are some solutions to do so

  • implement a ir.get function to retrieve data from the awkward array (see also #184). The result can be re-assigned to adata.obs for plotting and/or grouping
  • re-implement all scanpy plotting functions as e.g. ir.pl.umap, where color is retrieved from the awkward array instead of .obs.

3. Where to store the data

The current draft implementation only supports awkward arrays in .obsm. IMO it would be completely fine to have an .obsm["airr"]. In the future, it may be possible to have a column .obs["airr"].

It was also previously suggested to use mudata as AIRR data can be considered a modality. Since we don't have any data for .X and .var, I don't see a clear advantage of using mudata, though.


I'm looking forward to feedback, especially from people that develop ecosystem packages for AIRR analysis (@zktuong, ..., ?)

grst avatar Jun 20 '22 18:06 grst

@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a scirpy and an AIRR representation. Need to hash that out in more detail, but roughly

  • Cell metadata -> adata.obs
  • Rearrangement chains -> adata.obsm["airr"] (as awkward array)
  • CellExpression -> adata.X or mudata (in case of CITE-seq)

grst avatar Jun 20 '22 18:06 grst

The CZI FOM is interesting as well...

@bcorrie from discussion with the SOMA team, the intent was to keep anndata and SOMA (formerly FOM) pretty consistent with each other. Ideally they are the same thing eventually.

Though I haven't heard much on SOMA development recently, so hopefully we are still on the same page here.

Where to store the data

@grst, I've recently heard people wanting to treat their re-arrangment data as a separate modality. E.g. have an AnnData with just re-arrangement info which could then be merged with other modalities. I think the main point here was to avoid having to arbitrarily associate AIRR data with another modality.

I think @bio-la and @crichgriffin could elaborate on this.

ivirshup avatar Jun 20 '22 18:06 ivirshup

I've recently heard people wanting to treat their re-arrangment data as a separate modality.

Wouldn't it anyway work with both anndata and mudata to use .obsm["airr"].

# anndata
ir.tl.something(adata)
# anndata, override obsm_key
ir.tl.something(adata, airr_key="ir")
# mudata
ir.tl.something(mdata.mod["AIRR"])

Or would you prefer something like

# mudata only, use `.mod["AIRR"]` by default
ir.tl.something(mdata)
# ... or override default modality
ir.tl.something(mdata, mod="myairr")

grst avatar Jun 21 '22 12:06 grst

I don't remember the details, but one of the issues was with subsetting. Wanting to filter AIRR obs, but not the RNA.

I think it would work with AnnData right now, I think the issue was more about want to use scirpy with a MuData where RNA and AIRR were in separate modalities? I think this sorta worked, but there were issues with mudata operations.

ivirshup avatar Jun 21 '22 14:06 ivirshup

In the context of multimodal data and using MuData, I like the idea of keeping the AIRR data separate from any specific modality. My current workflow is to store the scirpy anndata object as a separate modality in muon, where I also have rna and adt data, as separate modalities. But I agree that it doesn't make a lot of sense, since the scirpy anndata object is mostly empty.

I agree with @grst that storing airr data could potentially work in both AnnData and MuData by using .obsm["airr"] except (iirc) in AnnData .obsm row dimensions match .obs row dimensions, and I assume the same applies to MuData. What would you do in the instances of cells not having AIRR data? E.g. in PBMC data, monocytes would not have airr data, but you would want to keep them in your AnnData/MuData.

crichgriffin avatar Jun 21 '22 15:06 crichgriffin

What would you do in the instances of cells not having AIRR data?

fill the corresponding rows with NaNs? This is how scirpy also handles this currently. Although this is a good point in favor of mudata. It would also make ir.pp.merge_with_ir superfluous.

I think it should be easy to support both. Just need to think what to advertise as default in the tutorial.

grst avatar Jun 21 '22 16:06 grst

all of this sounds awesome. Just have some naive question about how well/fast does it deal with situation when there's >100k+ cells? would chains_as_awkarr take a while to retrieve the entries?

zktuong avatar Jun 24 '22 21:06 zktuong

That's a fair point, I need to try this out, but it should be very fast. Awkward array claims to be similarly scalable as numpy arrays.

grst avatar Jun 25 '22 11:06 grst

Hi All, sorry for the lack of input - both fighting "the big C" and travelling to meetings (I don't recommend trying to both at the same time) the last several weeks.

I am not an AnnData expert - yet - but my observations thus far.

We are experimenting with Conga and it annotates each AnnData cell with heavy and light chain VDJ/CDR3 in the .obs of the object. Other tools like CellTypist also populate the .obs with other data.

This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice. It seems to make sense to me that the community might encourage this???

bcorrie avatar Jul 02 '22 11:07 bcorrie

2. Getting values for plotting

The current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway.

Yes, it occurred to me that plotting might be challenging using obsm. I personally think this would be quite important.

I would anticipate using the obsm['airr'] to store not only rearrangement annotations, but also what we call repertoire annotations (by repertoire AIRR means study/subject/sample metadata). For example, it is quite easy for us to generate a pool of cell data from many subjects in a study, possibly across disease conditions (healthy, mild covid, severe covid). We would pool that data and create a single h5ad file. We would want the 'disease_state' and 'subject_id' to be in the AIRR obsm object so we could visualize cells based on these fields. Same for the AIRR 'tissue' field and a range of others potentially.

It seems like a bad idea to have this difficult.

bcorrie avatar Jul 02 '22 12:07 bcorrie

A quick example - this is from a combined Conga/CellTypist analysis, with the .obs annotated with Conga and Cell Typist fields. This is trivial to visualize VDJ annotations per cell next to CellTypist majortiy voting - at the same time - because the columns are added to .obs. If they were in .obsm and the visualizations could not make use of this easily, this would be challenging...

image

'vb', 'jb', and 'va' are from Conga, while 'majority_voting' is from CellTypist.

I just threw these data sets together today to try and explore this data for our early experiments with CellTypist. So please be patient with the data oddities with TR call against projected B-cells. I have yet to validate anything from this image in terms of it making sense - and any and all errors are mine 8-)

The main point is that these experimentations are easy because the visualizations are easy.

bcorrie avatar Jul 02 '22 12:07 bcorrie

@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a scirpy and an AIRR representation. Need to hash that out in more detail, but roughly

  • Cell metadata -> adata.obs
  • Rearrangement chains -> adata.obsm["airr"] (as awkward array)
  • CellExpression -> adata.X or mudata (in case of CITE-seq)

This seems to make sense to me as a first pass, although I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable.

One could have adata.obsm['airr_rearrangement'] and adata.obsm['airr_repertoire'] - but maybe too clunky? Another related link that cells potentially have is to "Receptors" - which are known B/T cell receptors that have a specific antigen/epitope specificity (think of the B and T cell specificity info in IEDB). Again, maybe worth capturing.

bcorrie avatar Jul 02 '22 13:07 bcorrie

For what it's worth, on the R side we've been using MultiAssayExperiments with a "rearrangement" experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix. We have a primary key (row names) derived from the locus field to easily separate out IGH/TRB, but I don't think that's necessary. Sample/cell level metadata goes into the colData (.obs) as is typical.

If I'm understanding the awkward array correctly, and I may not be, this would be the same as using the "record" array implementation to populate .X with an awkward array in a mudata object. A pandas DataFrame with multi-indexing seems like the most natural fit for working cellular Rearrangement data (eg, key on something like ['cell_id', 'locus', 'sequence_id']) and it looks like the conversion from awkward array to multi-indexed DataFrame is trivial. Unfortunately, my python is a bit rusty these days, so I could be misunderstanding.

PS: "We" is not the AIRR Standards WG in this case. I don't think we should have an official opinion on implementation.

javh avatar Jul 02 '22 23:07 javh

where to store the data (.obs vs. .obsm vs. .X)

The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix. We have a primary key (row names) derived from the locus field to easily separate out IGH/TRB, but I don't think that's necessary. Sample/cell level metadata goes into the colData (.obs) as is typical.

@javh this sounds pretty much like what I would imagine! Interesting to have the BumpyMatrix in .X. In combination with mudata this actually makes a lot of sense and would be consistent to other modalities.

I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable.

Other tools like CellTypist also populate the .obs with other data. This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice.

@bcorrie, I'll try to summarize here a bit the (typical) purpose of the different AnnData slots.

  • .obs has been designed to hold all cell-level metadata. I would stick to that and add the repertoire fields (disease state, tissue, etc.) to .obs to make them readily available for plotting and other scanpy functions. So far, AnnData does not distinguish between columns added by different tools. I like the idea of having more structure in .obs, but that's something we would need to discuss on the AnnData side.

  • .obsm has been designed to hold data matrices aligned to .obs, e.g. UMAP or PCA. One could argue that Rearrangement data is not a data matrix either. An alternative would be to add an "awkward array column" to adata.obs:

    adata.obs["airr_rearrangement"] = ak.Array(...)
    

    Yet another option would be to put the awkward array in .X as @javh suggests.

    The thing here is that having awkward columns requires some additional effort on the AnnData side and it is unclear how long this would take (see https://github.com/scverse/anndata/pull/647). Awkward arrays in obsm already work, but the PR needs to be wrapped up. Awkward arrays in X are not implemented either, but proably easier to fix than the pandas extension.


accessing data for plotting

If they were in .obsm and the visualizations could not make use of this easily, this would be challenging...

I agree that cell-type, patient info etc. must be directly accessible for plotting (which they would be if they are in .obs). One can argue how useful a UMAP plot with V/D/J genes really is, similar to other information from the rearrangement schema. I would envisage something like this:

# if one really wants to plot rearrangement fields, add them to adata.obs explicitly
adata.obs["my_v_gene_col"] = ir.get("v_call", receptor_arm="VJ", dual_ir="primary")
sc.pl.umap(adata, color="my_v_gene_col")

That would be easy enough IMO and makes it explicit which "chains" to choose from the possibly multiple chains that are available for each cell.

grst avatar Jul 05 '22 11:07 grst

@javh in R, did you use any specific package for immune receptor analysis?

grst avatar Jul 05 '22 12:07 grst

@grst, I use the Immcantation packages, because I'm biased like that. :)

Regarding cell metadata in .osb, that seems cleaner to me as well. You're going to want to be able to slice and copy cell metadata (1) across modalities (eg, cell types from the gene expression object, clonal clusters from rearrangement, etc) and (2) from the top level .obs if you go the mudata route. Same for .osbm data from other modalities (eg, look at clones on the UMAP from gene expression, correlate gene expression with mutation frequency / selection pressure from rearrangement, etc). "Look in .obs for every modality" seems simplest.

javh avatar Jul 05 '22 18:07 javh

@grst, I use the Immcantation packages, because I'm biased like that. :)

I'm asking because I was wondering if we can extend this effort beyond the scverse world and have a 1:1 mapping to a MultiAssayExperiment representation.

grst avatar Jul 06 '22 10:07 grst

@grst From the MuData perspective, it is more general than MultiAssayExperiment. So it should be possible to align with how the data is stored there!

gtca avatar Jul 06 '22 12:07 gtca

After the input from @javh and the discussion in https://github.com/scverse/anndata/pull/647, it seems cleanest to me to go the mudata route with AIRR data stored in .X. Prerequisite is that anndata will get awkward array support for .X with two aligned, fixed-length axes.

.X would then look something like this:

data sketch
  • axis 0 holds observations (=cells) (fixed-length)
  • axis 1 holds AIRR rearrengement variables (e.g. v_call, duplicate_count, ...) (fixed-length)
  • axis 2 holds 0-n chains (variable-length)

Data could be retrieved as follows:

mdata.mod["AIRR"][:, "junction_aa"].X

grst avatar Jul 18 '22 12:07 grst

For what it's worth, on the R side we've been using MultiAssayExperiments with a "rearrangement" experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix

@javh, do you happen to have an example where this is implemented in R? Specifically the AIRR assay data + bumpy matrix part?

grst avatar Aug 29 '22 14:08 grst

We are working on exporting AIRR Cell/Expression data in an anndata file, with AIRR metadata features as annotations. So not the Rearrangement fields, but rather the Repertoire metadata fields - so we know which cells have which subject/disease/etc characteristics.

I am not an expert on anndata and mudata so need to wrap my head around what we are talking about.

In the following:

axis 0 holds observations (=cells) (fixed-length) axis 1 holds AIRR rearrengement variables (e.g. v_call, duplicate_count, ...) (fixed-length)

I think these fixed lengths are the same, correct? That is for every entry on axis 0 there is an entry on axis 1? Or that length(mdata.mod["AIRR"].obs) == length(mdata.mod["Cell"].obs)

If want all of the AIRR Data on dimension 1 I do the following since dimension 1 has a name AIRR:

mdata.mod["AIRR"].X

And I get the original observations (cells) on dimension 0 in a similar way (and it would have a name as well)?

mdata.mod["GEX"].X

Would you suggest that these modality names are standardized for the different modalities of data, or are the arbitrary and up to the user?

bcorrie avatar Aug 29 '22 20:08 bcorrie