auspice icon indicating copy to clipboard operation
auspice copied to clipboard

Add reversions/back-mutations as within-Auspice-computed branch label

Open corneliusroemer opened this issue 2 years ago • 6 comments

Context
Reference backfilling is a big problem in SARS-CoV-2 sequences. All the information one needs to identify reversions back to reference is included in the auspice.json. This would for example allow me to quickly check that a Nextclade reference tree doesn't contain any reversions.

Description
As a user, I would like to be able to see nucleotide reversions (either only to reference, or to any previous state) be highlightable on the tree. For example as a branch label, like we do with clades or sometimes Spike mutations.

Examples
Usher already implements this feature, they must do it in the backend, so there's clearly some interest in this feature beyond me. image

Possible solution
I could write a custom Python script that post-processes an auspice.json to add this as a branch annotation. But it's silly to do this with a script when it could be implemented within auspice.json for all trees, for all users.

corneliusroemer avatar Jan 17 '22 13:01 corneliusroemer

Thanks @corneliusroemer — I completely agree and think this feature will immensely help with interpreting trees, especially Omicron. I’m going to expand this issue slightly to encompass changes we've discussed regarding display of mutations more generally.

Current situation for branch labels Branch labels must be defined within the dataset JSON, and we typically do this for clade and AA changes. Auspice only contains one piece of special behavior here - if the branch label key is aa then we selectively display the labels to avoid showing thousands of labels!

"branch_attrs": {
    "labels": {
        "aa": "ORF8: L84S",
        "clade": "19B",

Proposal for branch labels

Simplest (and most realistic short-term) would be a small augur script within nCoV. The better long-term solution would be to compute this within augur ancestral and augur translate and allow them to define branch labels which are subsequently exported. See https://github.com/nextstrain/augur/issues/720 for a proposal of how to define branch labels in node_data JSONs.

Current situation for mutation display Currently dataset JSONs report mutations on a branch per-nucleotide and per-gene. This data typically comes from augur ancestral and augur translate, respectively, although for nCoV we are using nextclade for the AA changes. Whether Ns are included is influenced by parameters to those augur commands. The JSON structure looks like so:

"branch_attrs": {
    "mutations": {
        "nuc": [ "T1N", "T2N", ...],
        "S": ["T716I"]

The tooltips used in auspice behave as follows:

tooltip mutations shown Ns gaps (deletions) insertions
branch / tip hover subset listed, Ns ignored count shown treated as mut¹ N/A²
branch shift+click all shown, incl gaps treated as muts treated as mut¹ N/A²
tip click all listed w.r.t. root³ click to copy list treated as mut¹ N/A²

¹ No grouping is performed, e.g. if we have deletions of pos a, a+1, a+2 then we report three "mutations" (to -). ² we have no way for auspice to parse insertions! ³ reversions are removed from mutations listed here

Proposed Display of Mutations We'd like to be able to display mutations, deletions and insertions grouped by certain categories, however where to draw the boundaries isn't clear:

  1. Homoplasies
  2. Reversions to parent state. These may also be homoplasies!
  3. Reversions to root state (which we assume is the reference used for basecalling).
  4. Novel (mutation has only happened once and is not a reversion to the root).

(Detecting runs of deletions/insertions which are homoplasic isn't trivial, but it is if we consider them as a series of individual events, as we currently do.)

These could be computed within auspice itself, unless there is some reason to leverage nextclade for this?

Relatedly, we should definitely move towards the aesthetics employed by nextclade for displaying mutations with badges!

What about Insertions?

It'd be wise to consider how insertions could be provided here, but this may be worthy of a separate issue (and shouldn't hold up implementing the previous sections). VCF-like style would be <FROM><POS><TO> where <TO> is >1 characters and includes <FROM>. For instance, an insertion of TAG after base C at position 3 would be C3CTAG (this is example 5.2.2 from the VCF reference v4.2). However it's worth noting our deletion syntax doesn't follow VCF style. A style which follows our deletion syntax might be more along the lines of 3TAG. It's not clear to me how to reference subsequent changes in the insertion (e.g. if a later event modifies the inserted bases).

jameshadfield avatar Jan 19 '22 04:01 jameshadfield

For instance, an insertion of TAG after base C at position 3 would be C3CTAG (this is example 5.2.2 from the VCF reference v4.2).

My 2c: please don't follow VCF off that cliff. :) I really, really wish VCF didn't include the base to the left of indels. It's distracting to include a base that does not change, it necessitated an additional special rule for insertions at the beginning of the sequence (the unchanging base to the right must be appended on the right, further ugh), and it complicates code that has to translate between VCF and other formats (for example requiring reference sequence input to convert to VCF when it would otherwise be unnecessary). The empty string is a perfectly valid <FROM> for a point insertion IMO. :) Some formats use "-" to avoid using the empty string. There are multiple better alternatives to VCF's base-to-the-left. </rant>

The rest of it sounds great! :)

AngieHinrichs avatar Jan 21 '22 00:01 AngieHinrichs

I only worked with VCFs for a short while a few years ago but I'd second Angie here, it drove me nuts!

emmahodcroft avatar Jan 21 '22 08:01 emmahodcroft

in nextclade, we have use <position-before-insertion><inserted-sequence>, like list insertion conventions. that position to the left can be 0 (one-based indexing) or -1 (zero-based indexing) when the insertion precedes the reference.

rneher avatar Jan 21 '22 09:01 rneher

Update:

I have this working for the on-click info panel, just need to extend it to the on-hover panel as well. I think subsequent PRs can then

  • collect runs of Ns / gaps into one visual element
  • implement the nice badges from Nextclade (I've got a proof of principle working here)
  • consider insertions. This has to start in augur I think.

image image

jameshadfield avatar Jan 26 '22 05:01 jameshadfield

This looks super awesome and incredibly useful James!

emmahodcroft avatar Jan 26 '22 08:01 emmahodcroft