airr-standards Extend Clone to single-cell context

Starting to think about this in the context of generating a lot of 10x VDJ data... it seems we will want to (eventually) have a way for Clones to contain cells (see https://github.com/airr-community/airr-standards/issues/273#issuecomment-568649516), instead of (or maybe in addition to) Rearrangements.

Just a marker for now, need to think more about what kind of representation would make sense...

Issues to be resolved:

[ ] How to represent multiple chains? Are they embedded in a single Clone object, do we have multiple Clone rows (which introduces other problems), do we create a separate CloneChain object, or something else?
[ ] What are the key relationships with other AIRR objects and how/where are the identifiers stored?

Jan 16 '20 21:01 scharch

Should the Clone definition also contain both chains? Right now it seems to support only one.

Jul 14 '20 15:07 schristley

@schristley, I think it will have to support germline_alignment (and all the related fields) as an array, of some sort, yes.

Jul 14 '20 16:07 scharch

In a separate call, @bussec and I discussed how to do this flexibly. It would be nice not to be limited to strictly two chains. It also is hard to come up with a terminology that covers both T and B cells. There was also the desire to be able to annotate non-productive chains. Using a dictionary or array object should allow multiple entries. Using a controlled vocabulary, we could use T and B cell specific terms to annotate/tag the chains. At the same time, we should make it easy to access the primary annotations directly.

Jul 15 '20 18:07 schristley

It also is hard to come up with a terminology that covers both T and B cells.

This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain and short_chain?

Jul 15 '20 18:07 javh

It also is hard to come up with a terminology that covers both T and B cells.

This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain and short_chain?

I heard a suggestion like "d-containing chain" and "not-d" but there's the concern it's not very robust. My question would be, do we have to have the same name? Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

Sure, tools would have to handle them specifically, but wouldn't they kinda have to do that anyways, like tools would want to know regardless if it was IGH versus TRB?

Jul 15 '20 19:07 schristley

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

@schristley I was just coming here to suggest essentially the same thing.

It'll still get complicated, though: if each chain is a dict with keys something like {id, type, is_productive}, then a Cell would be an array of those and the "members" of Clone ends up being an array of arrays of dicts. Does that seem workable?

At the same time, we should make it easy to access the primary annotations directly.

Each Cell in the Clone has a cell_id and a list of sequence_ids that link back to the rearrangements TSV - do you think that is sufficient?

Jul 15 '20 19:07 scharch

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

This is hard to use (have to check every object for field presence before fetching data), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:

https://doi.org/10.1016/j.cell.2019.05.007

Jul 15 '20 19:07 javh

@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.

Anyway, I think that having a type field would help with the parsing you are concerned about.

{ 
    cell:'cell_id',
    type:'b_cell'
    heavy_chain: [ 'sequence_id1' ],
    light_chain: ['sequence_id2', 'sequence_id3' ]
}

But probably even better would be something like

{
    cell:'cell_id',
    type:'b_cell',
    chains:[
                  { sequence:'sequence_id1', type:'heavy_chain',... },
                  { sequence:'sequence_id2', type:'light_chain',... },
                  { sequence:'sequence_id3', type:'light_chain',... },
               ]
}

Jul 15 '20 19:07 scharch

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

This is hard to use (have to check every object for field presence), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:

I still need to think through the Cell-Clone relationship, but focussing purely on Clone right now, we could still have explicit fields name, but with generic names (chain_1, chain_2, primary_chain, secondary_chain, long_chain, short_chain). Actually, as a matter of fact, maybe keep the exact same Clone fields we have right now (v_call, j_call, etc.) but just add new fields for the second chain. And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

v_call:
    type: string
chain_type:
    type: string
    enum:
        - IGH
        - TRB
v_call_1:
   type: string
chain_type_1:
    type: string
    enum:
        - IGL
        - TRA

This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated.

Jul 15 '20 19:07 schristley

@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.

I don't know. Probably only if a need arises. Though, naively, it looks trivial to my eye. You use clone_id as the row key and exclude the sequences field. If you need the individual sequence level data, you'd then search the Rearrangement data by clone_id. Then it's just a clone summary table. But, that's without considering Cell.

Some sort of type field seems like it might be a solution. Though, you'd still have to do a check of some kind, but it would be a simpler check.

The way Clone is setup right now seems really geared towards IGH/TRB/TRD data only. Hrm.

Jul 15 '20 19:07 javh

Each Cell in the Clone has a cell_id and a list of sequence_ids that link back to the rearrangements TSV - do you think that is sufficient?

I'm still thinking through this. A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells. Furthermore, if you gather up all the rearrangements for all those Cells, is that the same list of rearrangements in Clone's sequences array?

Jul 15 '20 20:07 schristley

And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.

A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells.

Yes but why treat Cells differently than Rearrangements here? Biologically, the Clone is comprised of Cells, not Rearrangements...

Furthermore, if you gather up all the rearrangements for all those Cells, is that the same list of rearrangements in Clone's sequences array?

Sort of? Not the way it's currently set up with only one chain, but this should be correct under the extension models we are discussing.

Jul 16 '20 16:07 scharch

And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.

An optional extended data structure like you suggested above for providing additional chains.

Jul 16 '20 16:07 schristley

A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells.

Yes but why treat Cells differently than Rearrangements here? Biologically, the Clone is comprised of Cells, not Rearrangements...

"better" only in a data structure sense. As a Cell belongs to one Clone, it could be represented with a single field clone_id, while a Clone containing many Cells would require an array of cell_ids.

Jul 16 '20 17:07 schristley

OK, we are currently implementing 10X data loading for rearrangements/clones/cells/expression.

We can currently load everything in principal and practice, based on the current AIRR Spec.

The problem arises when you try to map a specific tool chain (e.g. 10X cellranger) to the spec, in particular one that generates all of the data types as part of one processing run - when everything blows up.

I think this issue is the crux of the matter - and we appear to have been avoiding it since July 2020 8-)

In the 10X case you get:

A single clone_id has multiple chains. I have seen two and three chains thus far for a single clone_id
Our current Clone object is focused on a single chain only
Pretty well all fields in the Clone object that describe the clone (VDJ calls, junction, alignment, sequences) need to be different for each of the chains in the clone (not just the VDJ calls as discussed above). I count 18 fields based on a quick count.

So we can't really load 10X data in a particularly logical or coherent fashion when you try to do all of Rearrangements/Clones/Cells in a single repository. I am pretty sure this would also mean that you couldn't represent said data in a set of files on disk using a Manifest to tie them together...

This seems like something that should be pretty high on the priority list if we really want to claim that we have a working Rearrangement/Clone/Cell spec 8-)

Feb 09 '22 18:02 bcorrie

So something like this
v_call:
    type: string
chain_type:
    type: string
    enum:
        - IGH
        - TRB
v_call_1:
   type: string
chain_type_1:
    type: string
    enum:
        - IGL
        - TRA
This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated.

I am not sure this would work, given that I think there are on the order of 18 fields that would need different values for multiple chains...

It seems to me that a Clone object should be an array of N CloneChains (where N is small (1-3?) but flexible) with each of the 18 fields that describe the "inferred ancestor of the clone" in the CloneChain object???

I also wonder if we should drop the sequences array, since you can look up the sequences associated with a CloneChain using the clone_id in the Rearrangements

Feb 09 '22 18:02 bcorrie

Alternatively we could leave the Clone object as is, but treat it as a CloneChain object, and store multiple CloneChain objects with the same clone_id. You would then link multiple chains that are associated with the same Clone through the clone_id.

This is how we are going to load data for now, as this is really the only way to link multiple chains...

Feb 09 '22 19:02 bcorrie

I don't have a great suggestion for this right now, but I think this intersects with how we might want to think about Receptor. There's a couple things going on in the current Clone schema - properties of specific observed sequences (sequences, junction, etc) and properties of the naive ancestor that are common to all the observed members of a clone (v_call, germline_alignment, etc). The latter seems like something we can separate out into an object and use for both Clone and Receptor and then nest under primary_chain and secondary_chain.

TRA is going to be a major problem, as it's pretty common to get more than one productive TRA transcript. Do we want Clone to support more than two chains? If so, do we add more chains or nest under the relevant chain somehow? If not, do we want to allow for multiple clone_id per sequence?

Feb 09 '22 19:02 javh

TRA is going to be a major problem, as it's pretty common to get more than one productive TRA transcript. Do we want Clone to support more than two chains? If so, do we add more chains or nest under the relevant chain somehow? If not, do we want to allow for multiple clone_id per sequence?

Yes, I have seen some data from 10X that have multiple TRAs (3 chains per clone) - that is where I started down this rabbit hole of trying to figure out how we should curate this. I think the array of CloneChain could be pretty flexible and maybe handle this...

Good point about separating out the info of the naive ancestor from the observed properties...

Feb 09 '22 20:02 bcorrie

I still think that the simplest solution is to allow Clone to contain Cells as an alternative to Rearrangements. Then each cell can hold (point to)? an arbitrary number of rearrangements as needed. The inferred_ancestor also becomes a cell. And all this is more biologically "correct," too.

I also wonder if we should drop the sequences array, since you can look up the sequences associated with a CloneChain using the clone_id in the Rearrangements

No, you can't --or at least you're not guaranteed to be able to. If you Clone of interest is from the original/primary analysis, it might work, but if it's a secondary or reanalysis the cell's clone_id will point forever more only to the first one.

Feb 10 '22 17:02 scharch

I still think that the simplest solution is to allow Clone to contain Cells as an alternative to Rearrangements.

I like this. I'm not sure how to implement it. We'll have to figure out what to do about the _count fields, especially umi_count. But, it also gets ahead of the issue of how to extend the Tree schema to paired VH:VL lineage reconstruction.

Feb 21 '22 18:02 javh

I think that umi_count can be left as is (will be null for this case) and the definition of clone_count can be expanded slightly to include the number of Cells in the Clone.

It seems to me that the bigger lift will be letting tools know what type of data they are looking at, but maybe we can get away with just a cell/chain flag?

Feb 22 '22 19:02 scharch

I think that umi_count can be left as is (will be null for this case) and the definition of clone_count can be expanded slightly to include the number of Cells in the Clone.

The clone_count description has been updated to also mention cells...

It seems to me that the bigger lift will be letting tools know what type of data they are looking at, but maybe we can get away with just a cell/chain flag?

If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?

Feb 22 '22 19:02 schristley

The clone_count description has been updated to also mention cells...

OK, I wasn't reading it like that, but you're right. @javh does this way of doing it satisfy you?

If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?

I guess, but I thought we've been trying to avoid that kind of two-step look up...

Feb 22 '22 19:02 scharch

I still think that the simplest solution is to allow Clone to contain Cells as an alternative to Rearrangements.

I like this. I'm not sure how to implement it. We'll have to figure out what to do about the _count fields, especially umi_count. But, it also gets ahead of the issue of how to extend the Tree schema to paired VH:VL lineage reconstruction.

I like it too. Our challenge is how to handle the identifiers. Right now Clone has a sequences fields which is all the rearrangement IDs. I have repertoires where there are thousand upon thousands of rearrangement records that make up a clone. Sticking such a huge array in Clone is kind of ridiculous... We are talking about a 1-N relationship, and it's always more efficient (from a data structure perspective) to store the link on the N side, i.e. the rearrangement table.

Now I suppose Clone could have a cells fields which references the cell IDs, and you might make the argument that there will be less cell records... But honestly, I'm seeing single cell experiments that do upwards of 100K cells, so how long will that hold?

It's the same situation with a 1-N relationship between clone and cell, so it makes sense to put the clone_id inside of Cell instead of having a list of cell IDs in clone.

Feb 22 '22 20:02 schristley

If you look at a rearrangement record for the clone, it will have a cell_id, that is one indication?

I guess, but I thought we've been trying to avoid that kind of two-step look up...

My perception is if you were working in a single cell context, your workflow might be like this:

Query the studies/repertoires of interest.
Query the cells based upon the repertoire IDs.
If clone data is wanted for cells, get using the clone_id in Cell.
If rearrangement data is wanted for cells, query rearrangements using cell_id.
If receptor data is wanted for cells, get using the receptor_id in Cell.

So you will be working through the Cell objects to get to other data.

Feb 22 '22 20:02 schristley

Sticking such a huge array in Clone is kind of ridiculous... We are talking about a 1-N relationship, and it's always more efficient (from a data structure perspective) to store the link on the N side, i.e. the rearrangement table.

I get it, but Cells can be members of multiple Clones and, more importantly, we've set it up so that a Cell record is supposed to be more-or-less immutable in the ADC. So if my Clone of interest was generated by some sort of post-publication meta/re-analysis, you have to (as far as I can see) put cell_id into Clone instead of vice versa...

Feb 22 '22 20:02 scharch

My perception is if you were working in a single cell context, your workflow might be like this:

Query the studies/repertoires of interest.

Query the cells based upon the repertoire IDs.

If clone data is wanted for cells, get using the clone_id in Cell.

If rearrangement data is wanted for cells, query rearrangements using cell_id.

If receptor data is wanted for cells, get using the receptor_id in Cell.

So you will be working through the Cell objects to get to other data.

I have to think more about this, but as a general matter you are right that I will know if I'm working in a Cell context or a Rearrangement context. Is that enough, though?

Feb 22 '22 20:02 scharch

I get it, but Cells can be members of multiple Clones

Hmm, that's challenging as that implies an N-N relationship, but this isn't the biology right as a cell only belongs to one clone. So the multiple Clones come from running different analyses?

and, more importantly, we've set it up so that a Cell record is supposed to be more-or-less immutable in the ADC. So if my Clone of interest was generated by some sort of post-publication meta/re-analysis, you have to (as far as I can see) put cell_id into Clone instead of vice versa...

I think this can handled with data_processing_id, so Cell needs clone_id but also the data_processing_id that computed the clone. This doesn't handle all possibilities though, it is essentially turning the N-N relationship back into 1-N, but if we truly need N-N regardless of data processing, then we don't have much choice.

Feb 22 '22 20:02 schristley

we've set it up so that a Cell record is supposed to be more-or-less immutable in the ADC.

I wonder if you mean immutable or if you mean a singleton? Nothing in the ADC is really immutable, any of the records could be updated with additional identifiers if new data processing is performed, which is then loaded into the ADC.

But I think I understand your point, if you pull out Cells from ADC then do a re-analysis of clones, it makes sense that you are generating clone data de novo while the Cell data stays the same...

Feb 22 '22 20:02 schristley

airr-standards airr-standards copied to clipboard

Extend Clone to single-cell context

airr-standards
airr-standards copied to clipboard