ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Tilt sampling towards focal region

Open trvrb opened this issue 3 years ago • 10 comments

Existing subsampling aimed at 1.5:1 ratio of regional focal samples vs global contextual samples. This commit tilts this subsampling to be a 3:1 ratio of regional focal samples vs global contextual samples. The 2:1 ratio of late to early samples remains unchanged.

Existing

Desired subsampling counts

global: 1600 region: 2500 total: 4100 ratio of region to global: 1.56X

This PR

Desired subsampling counts

global: 1050 region: 3150 total: 4200 ratio of region to global: 3X

Late vs early remains at 2X

Relative to existing:

Region goes from 2500 to 3150 (1.26X) Global goes from 1600 to 1050 (0.65X) Early goes from 1400 to 1400 (1X) Late goes from 2700 to 2800 (1.04X)

Example workflow outputs

(This used @tsibley's new automatic branch AWS batch deploys. So cool.)

  • Africa: https://nextstrain.org/staging/trial/focal-sampling/ncov/africa
  • Asia: https://nextstrain.org/staging/trial/focal-sampling/ncov/asia
  • Europe: https://nextstrain.org/staging/trial/focal-sampling/ncov/europe
  • North America: https://nextstrain.org/staging/trial/focal-sampling/ncov/north-america
  • Oceania: https://nextstrain.org/staging/trial/focal-sampling/ncov/oceania
  • South America: https://nextstrain.org/staging/trial/focal-sampling/ncov/south-america

I think this is an improvement. The overall tree structure seems okay and the geographic inferences seem comparable to the existing code. However, for looking at emerging clades of interest, this seems preferable. For example:

  • B.1.526 in the US
    • Existing: https://nextstrain.org/ncov/north-america?f_country=USA&f_pangolin_lineage=B.1.526 with 2 tips
    • PR: https://nextstrain.org/staging/trial/focal-sampling/ncov/north-america?f_country=USA&f_pangolin_lineage=B.1.526 with 7 tips
  • B.1.1.7 in Europe
    • Existing: https://nextstrain.org/ncov/europe?c=country&f_pangolin_lineage=B.1.1.7&f_region=Europe with 270 tips
    • PR: https://nextstrain.org/staging/trial/focal-sampling/ncov/europe?c=country&f_pangolin_lineage=B.1.1.7&f_region=Europe with 366 tips

This isn't a big push, but I think it's in the right direction. Basically this comes from me wanting more resolution to clades / lineages in particular regions and going from ~2500 to ~3100 sequences in a region is decently helpful for this without causing much harm in terms of missing context.

If you think this change is too extreme we could dial back to 2:1 or 2.5:1 ratio of region to global from the PR 3:1 ratio.

trvrb avatar Feb 27 '21 06:02 trvrb

I agree, this is better.

rneher avatar Feb 27 '21 12:02 rneher

I agree, I think this is a good idea! :)

emmahodcroft avatar Feb 27 '21 15:02 emmahodcroft

B.1.526 in the US Existing: https://nextstrain.org/ncov/north-america?f_country=USA&f_pangolin_lineage=B.1.526 with 2 tips PR: https://nextstrain.org/staging/trial/focal-sampling/ncov/north-america?f_country=USA&f_pangolin_lineage=B.1.526 with 7 tips

B.1.526 is getting more tips there -- but it's also getting placed in 20A not 20C, which seems like a better place for it.

In the staging build's tree, somehow ORF3a:Q57H (G25563T) is getting pulled closer to the root than 20A ORF1b:314L (14408T), and then on the way to 20B there's a back-mutation ORF3a:H57Q (T25563G). On that big back-mutation polytomy, there's a C1059T (which would normally make 20C) -> 21575T -> another ORF3a:Q57H (G25563T) -> S:D253G (A22320G) -> ... the B.1.526 samples. [hand-waving...] not having enough background causes funny things to happen with the tree.

Someone else suggested "defining a backbone like nextclade does" as referenced in #564 -- that might help to at least make the major clade-defining mutations happen in the right order, and hopefully would result in sequences being placed in the right clades.

AngieHinrichs avatar Feb 28 '21 05:02 AngieHinrichs

Good catch Angie. Perhaps we should keep the 'early' sampling ratios more like they used to be - I imagine it might be bias in the earlier samples that's causing this, so including more global sequences alongside regional ones might ensure this comes out more stably.

We could alternatively also include a few representative sequences from each clade (I actually do this for CoVariants runs) - but as the major groups are still there it seems we'd likely need more sequences than just a couple. Perhaps we should try adjusting the ratios a little and running a few times, before merging?

emmahodcroft avatar Feb 28 '21 16:02 emmahodcroft

Thanks for the catch here @AngieHinrichs. For seasonal flu we curate a list of these "reference" sequences that are preferentially included. We've chosen these reference sequences to help keep clade structure behaving as it should (in addition to adding viruses that have lots of HI data).

I'm a bit dubious about adding "reference" viruses to include.txt for ncov as it would help this situation of wanting a correctly structured large-scale phylogeny, but there are uses of the ncov workflow for a smaller number of tips that target a particular outbreak, etc... where reference viruses would be a distraction.

I do worry that as more time passes the "early" sampling will get more and more diffuse and we'll run into these sorts of situations anyway. However, for the time being, I'll adjust sampling numbers in this PR to try to have quick fix.

trvrb avatar Feb 28 '21 22:02 trvrb

I think we could include a 'backbone' set of sequences just for the nextstrain-profile build (so shouldn't hopefully impact more focal builds). I think for me the main question would be, can we find a set of sequences that reliably helps classify even small clusters like the one here - or is this going to need to be more finely curated every now and then.... (which will be hard to sustain).

FWIW I went through a few older runs and it does seem like we've historically been putting this lineage in 20C - so it doesn't necessarily just jump around in general:

https://nextstrain.org/ncov/north-america/2021-02-25?f_country=USA&f_pangolin_lineage=B.1.526 https://nextstrain.org/ncov/north-america/2021-02-24?f_country=USA&f_pangolin_lineage=B.1.526 https://nextstrain.org/ncov/north-america/2021-02-23?f_country=USA&f_pangolin_lineage=B.1.526 https://nextstrain.org/ncov/north-america/2021-02-15?f_country=USA&f_pangolin_lineage=B.1.526 https://nextstrain.org/ncov/north-america/2021-02-18?f_country=USA&f_pangolin_lineage=B.1.526

emmahodcroft avatar Mar 01 '21 09:03 emmahodcroft

So, I rebuilt these trial datasets:

  • Africa: https://nextstrain.org/staging/trial/focal-sampling/ncov/africa
  • Asia: https://nextstrain.org/staging/trial/focal-sampling/ncov/asia
  • Europe: https://nextstrain.org/staging/trial/focal-sampling/ncov/europe
  • North America: https://nextstrain.org/staging/trial/focal-sampling/ncov/north-america
  • Oceania: https://nextstrain.org/staging/trial/focal-sampling/ncov/oceania
  • South America: https://nextstrain.org/staging/trial/focal-sampling/ncov/south-america

with a updated sampling scheme of:

  • global early: 700
  • region early: 700
  • global late: 700
  • region late: 2100

to keep a total of 4200 and to give a 2:1 ratio of late to early and a 2:1 ratio of region to global.

Here, the intent was to enrich for region late (going from existing 1700 to PR 2400) and I accomplished this by moving mainly by moving global late into region late.

See you what you think. I can trigger another rebuild and give it a different name to have another go round.

trvrb avatar Mar 01 '21 17:03 trvrb

Thank you Trevor and Emma! I took a quick look at the updated builds in divergence mode, and the clades look like they're falling out in the right order [one small exception, Asia build, 20G and some 20C-ish friends are coming out of their own little branch from 20A with {1059, 25563} lumped together, dunno why it's separate from the 1059 -> 25563 branch for 20C; but that's all I noticed.] Definitely an improvement.

AngieHinrichs avatar Mar 01 '21 22:03 AngieHinrichs

Thanks @AngieHinrichs! And here is the second run with updated parameters:

  • Africa: https://nextstrain.org/staging/trial/focal-sampling-2/ncov/africa
  • Asia: https://nextstrain.org/staging/trial/focal-sampling-2/ncov/asia
  • Europe: https://nextstrain.org/staging/trial/focal-sampling-2/ncov/europe
  • North America: https://nextstrain.org/staging/trial/focal-sampling-2/ncov/north-america
  • Oceania: https://nextstrain.org/staging/trial/focal-sampling-2/ncov/oceania
  • South America: https://nextstrain.org/staging/trial/focal-sampling-2/ncov/south-america

trvrb avatar Mar 01 '21 23:03 trvrb

Thanks! Asia looks good there but Africa got a little mixed up. See how ORF3a:Q57H (25563) compares in the two focal-sampling builds:

https://nextstrain.org/staging/trial/focal-sampling/ncov/africa?gt=ORF3a.57H&m=div

https://nextstrain.org/staging/trial/focal-sampling-2/ncov/africa?gt=ORF3a.57H&m=div

In the first, there is one big branch where ORF3a:Q57H (25563) leads to some 20A, then 20C (with ORF1a:T265I / 1059), and 20G and 20H emerge from 20C as expected.

In the second, down in 19A, somehow ORF3a:Q57H (25563) is getting pulled closer to the root than ORF1b:P314L (14408 which usually goes with 241, 3037 and 23403) and 20G appears to come from 19A. Up towards the top, 20H pops out of 20A -- there is no 20C in focal-sampling-2.

I lost track -- is there a difference in sampling schemes between focal-sampling and focal-sampling-2 or are they just different runs with the same settings?

AngieHinrichs avatar Mar 02 '21 04:03 AngieHinrichs