ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Include Nextclade_pango as coloring / filter in Nextstrain profiles

Open trvrb opened this issue 2 years ago • 5 comments

Description of proposed changes

This updates both Nextstrain GISAID and Nextstrain open profiles to include Nextclade_pango as a coloring and filter in addition to the existing pango_lineage coloring and filter.

Nextclade_pango calls are run by the Nextstrain team directly on sequences from GenBank / GISAID. pango_lineage calls are taken verbatim from GenBank / GISAID.

There is worry that that having multiple versions of Pango calls will create user confusion, but the Pango calls are inherently messy and so surfacing the fact that different methods give different results may not be a bad thing.

Testing

I've done limited testing with this so far. Will spin out a trial build to test with.

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • [ ] Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
  • [ ] Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • [ ] After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

  • [ ] Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

trvrb avatar Apr 14 '22 17:04 trvrb

Trial build running on GISAID data at https://github.com/nextstrain/ncov/actions/runs/2168780200. This will deploy to:

  • https://nextstrain.org/staging/ncov/gisaid/trial/nextclade-pango/global
  • https://nextstrain.org/staging/ncov/gisaid/trial/nextclade-pango/africa etc...

Trial build running on GenBank data at https://github.com/nextstrain/ncov/actions/runs/2168781757. This will deploy to:

  • https://nextstrain.org/staging/ncov/open/trial/nextclade-pango/global
  • https://nextstrain.org/staging/ncov/open/trial/nextclade-pango/africa etc...

trvrb avatar Apr 14 '22 17:04 trvrb

I've now done a number of anecdotal comparisons here between pango_lineage call from GISAID, which should be latest pangoLEARN and Nextclade_pango. From anecdotal investigation, this honestly feels like a wash.

In all the below figures I've colored tree based on Nextclade_pango calls and provided tip labels according to pango_lineage calls.


pangoLEARN better

BA 2 9

One clear BA.2.9 virus called incorrectly as BA.2 by Nextclade.

BA 2 10

One clear BA.2.10 virus called incorrectly as BA.2 by Nextclade.


Nextclade better

AY 39-v2

Two clear AY.39 viruses called as B.1.617.2 by pangoLEARN. Unclear what's going on with one virus called as AY.42 by pangoLEARN and AY.39.1 by Nextclade.

BA 4-5

BA.4 and BA.5 viruses properly called by Nextclade but called as BA.2 by pangoLEARN.


Wash

B 1 17

One clear BA.1.17.2 viruses called incorrectly as BA.1.17 by Nextclade. This clade of BA.1.17 nested within BA.1.17.2 that's unclear what the right call is.


@corneliusroemer: I'd think to do more systematic comparison here before adopting this. Again, anecdotally this feels like a wash and I'm worried about user overhead in having two versions of Pango calls.

trvrb avatar Apr 14 '22 20:04 trvrb

Thinking slightly more broadly, making "correct" calls here (by either pangoLEARN or Nextclade) should mostly be about somehow calling a sequence correctly despite the sequence being shoddy. If I look at individual branches that have discrepancies between pangoLEARN and Nextclade, I see homoplasic mutations. pangoLEARN will have effectively weighted some mutations more strongly and Nextclade will have effectively weighted other mutations more strongly.

Feels like you almost need an error model to properly do this. Some sites will be less reliable than others.

trvrb avatar Apr 14 '22 20:04 trvrb

Hmm... but the older pangoLEARN calls by GenBank are clearly an issue relative to Nextclade calls. For example, take a look at:

Screen Shot 2022-04-14 at 1 24 51 PM

All these BA.1.15 and BA.1.15.1 viruses are being called as BA.1 by the older version of pangoLEARN.

Screen Shot 2022-04-14 at 1 27 32 PM

Also these BA.2.10 and BA.2.10.1 viruses are being called as BA.2 by the older version of pangoLEARN.

trvrb avatar Apr 14 '22 20:04 trvrb

That's a great way to compare, with labels! Would be great if I could check out the specific samples to figure out whether the tree placement may be wrong - that's possible but yeah not that likely. I guess I'll just check the same trees you had a look at.

Let me try a careful comparison as I'm somewhat surprised by your findings. But grateful, too, since it's always good to have someone try to find problems!

Start with global build.

BA.2.9 Your first example. Finding false positive pangoLEARNs I filter to pangoLEARN=BA.2.9, color by Nextclade. Quite some disagreement. But it turns out this is entirely pL (short for pangoLEARN) thinks 22792 defines BA.2.9 - but it doesn't, it's objectively ORF3a:78Y that's defining. Color by that mutation and you see how off pL is. pL has loads of false positives, on the order of twice as many calls as there should be.

image image image

Ok other way round, filter by Nextclade, color by pL: Identical, so no BA.2.9 seem to be missed by either pL nor NC image

Result: Nextclade: 100% sensitive and specific pangoLEARN: 100% sensitive, 50% specific

How does it square with your findings? Did you check both ways? I think pL is notorious for having high percentage of false positives - it doesn't quite understand what is defining and will pick random mutations that aren't in fact defining - overlearning.

The one outlier of Nextclade may be a reversion of the defining mutation. pL and Usher have that advantage of picking up on unique diversity that shows something is in a lineage despite main reversion. But that only makes Nextclade a bit less sensitive than 100% in practice. Maybe 95%.

However, pangoLEARN can be as bad as 50% accurate, see above. Usher has problem that it may miss important clades and sometimes call things that are reverted by wrong clade. Can go both ways, depends a lot on how correct the tree is. Would have to do a proper random sample to measure.

OK, I'll do one more where pL seems better per above.

BA.2.10 Defining mutation: T16342C (see pango issue or Nextclade reference tree)

Nextclade calls 38 T16342C, out of 42 in BA.2 with that mutation. The missing 4 are probably BA.2.10.1s. Doesn't call any as BA.2.10 that don't have defining mutation. Again, sensitivity 95-100%, specificity 100% image image

pL however, as usual, catches twice too many. Half don't cluster together and don't have defining mutation. They do catch all again.

So again, pL also has 95-100% sensitivity, but only 50% specificity.

image

That's the pattern I see. Nextclade misses very few. Doesn't overcall.

pangoLEARN is sensitive, maybe 99% as opposed to Nextclade 98% but the specificity is awful, could be as bad as 50%.

What pangoLEARN overcalls in BA.2.9 it undercalls in BA.2, so actually if we were to look at BA.2 it'd be the other way round: pango not so sensitive, even though the few 10 don't matter among hundreds.

Does this summary make sense to you @trvrb? This matches the gut feeling I had before. Happy to make it more quantitative.

Is there anything in this evaluation I'm missing from your perspective?

Does it make sense how I define ground truth? For me as it seems obvious, I know the issues etc. It could get hairy when there are many mutations and one reversion should be accepted etc. But in the cases we looked at it's quite clear. Reversions are after all not that common, and if bad sequences are miscalled that's more acceptable than good sequences being miscalled.

So my conclusions are a bit different, I wouldn't call it a wash or undecided - but maybe I'm missing something.

Thinking slightly more broadly, making "correct" calls here (by either pangoLEARN or Nextclade) should mostly be about somehow calling a sequence correctly despite the sequence being shoddy. Only assuming you're not being overlearnt, which pango is, having specificity of 50%. I think you may have missed that by your way of checking.

I agree that this is challenging in principle. But just to get to 99.5% from 99%, to get the shoddy ones right is hard. But we're not at all there. pL struggles at a different step. For UShER and Nextclade your assessment is correct. One can quantify it though by sampling lineages on which there's pairwise disagreement - that way you're not wasting time on agreement.

If I look at individual branches that have discrepancies between pangoLEARN and Nextclade, I see homoplasic mutations. pangoLEARN will have effectively weighted some mutations more strongly and Nextclade will have effectively weighted other mutations more strongly.

Yes, but much more often than Nextclade pangoLEARN weighs the wrong mutations heavily. Nextclade by definition of using synthetic sequences uses only the defining mutations. Yes, we're throwing away some info - but the benefit is we're using the most valuable one that should definitely be there. pL however is a blackbox - it can change behaviour from one training to the next, it's not clear why it does what it does. Not stable etc.

Feels like you almost need an error model to properly do this. Some sites will be less reliable than others. This is not really where I see the biggest difference here. Maybe it would be in the comparison with UShER.

corneliusroemer avatar Apr 15 '22 01:04 corneliusroemer