ncov icon indicating copy to clipboard operation
ncov copied to clipboard

WIP: Prototype a build showing an overview of all Pangolin lineages

Open huddlej opened this issue 3 years ago • 2 comments

Description of proposed changes

Adds a Snakemake profile with a single build that subsamples an existing filtered alignment of GISAID data to one sample per Pangolin lineage. The resulting tree shows the general relationships among these lineages.

This is a proof-of-concept for an approach to a general view of all circulating lineages that could link out to lineage-specific builds.

Testing

An example of this build is also available on the staging server.

huddlej avatar Mar 19 '21 23:03 huddlej

Thanks John! Doing a simple --group_by pango_lineage with --seq_per_group 1 was a very clever approach to the design problem. Here, I've just cleaned a few things up to:

  1. Provide color ordering that gives alphabetically adjacent lineages similar colors
  2. Use pango_lineage rather than pangolin_lineage to match with metadata field
  3. Provide PANGO-specific clades.tsv to help structure visualization
  4. Provide PANGO-specific Auspice config to default to coloring by PANGO lineage

The resulting output can be seen at: https://nextstrain.org/staging/ncov/lineages?tl=pango_lineage

If we decided to take this forward, I'd see a couple different things:

  1. Perhaps draw sequences from https://github.com/cov-lineages/pangoLEARN/blob/master/pangoLEARN/data/lineages.metadata.csv which represent the "gold standard" training data and should be less prone to mis-calls than random sequences.
  2. I do think it would be visually useful to distinguish between common lineages and rare lineages. After all, at the moment, just 3 lineages compromise ~60% of globally circulating viruses. I've thought to this by assigning a special (just like how num_date is special) attribute to each tip for count. Auspice could then use count to size tip circle radii, a bit similar to how the map uses collapsed tip count to size deme circles.

I think that (2) would be a first step towards semantic zoom. Here, I'm thinking of an "overview" tree that scaffolds out genetic diversity (in this case scaffolded based on PANGO lineage) and shows tips as circles of various sizes. Clicking on a tip would bring up the normal tip info panel, but we could include a URL link to something like https://nextstrain.org/ncov/lineages/B.1.525 that would give a normal ~4000 tip Nextstrain tree of just B.1.525 samples. There might be more sophisticated ways to implement semantic zoom, but this would at least give some steps in this direction.

trvrb avatar Mar 20 '21 15:03 trvrb

This is pretty cool! Thanks for doing this John, it's very interesting.

I like Trevor's additions, and was also thinking the same thing as his point 1 - we know some of the pango lineage samples sometimes end up pretty random places in our trees, so it would be good if we could avoid inadvertently sampling one of those as a 'representative'.

I also like Trevor's ideas in 2 both for sizing the tips and in linking to more dedicated builds, perhaps at least for a few of the largest lineages.

emmahodcroft avatar Mar 22 '21 18:03 emmahodcroft

superseeded by nextclade tree.

rneher avatar Apr 07 '23 13:04 rneher