ncov
ncov copied to clipboard
WIP: Prototype a build showing an overview of all Pangolin lineages
Description of proposed changes
Adds a Snakemake profile with a single build that subsamples an existing filtered alignment of GISAID data to one sample per Pangolin lineage. The resulting tree shows the general relationships among these lineages.
This is a proof-of-concept for an approach to a general view of all circulating lineages that could link out to lineage-specific builds.
Testing
An example of this build is also available on the staging server.
Thanks John! Doing a simple --group_by pango_lineage
with --seq_per_group 1
was a very clever approach to the design problem. Here, I've just cleaned a few things up to:
- Provide color ordering that gives alphabetically adjacent lineages similar colors
- Use
pango_lineage
rather thanpangolin_lineage
to match with metadata field - Provide PANGO-specific
clades.tsv
to help structure visualization - Provide PANGO-specific Auspice config to default to coloring by PANGO lineage
The resulting output can be seen at: https://nextstrain.org/staging/ncov/lineages?tl=pango_lineage
If we decided to take this forward, I'd see a couple different things:
- Perhaps draw sequences from https://github.com/cov-lineages/pangoLEARN/blob/master/pangoLEARN/data/lineages.metadata.csv which represent the "gold standard" training data and should be less prone to mis-calls than random sequences.
- I do think it would be visually useful to distinguish between common lineages and rare lineages. After all, at the moment, just 3 lineages compromise ~60% of globally circulating viruses. I've thought to this by assigning a special (just like how
num_date
is special) attribute to each tip forcount
. Auspice could then usecount
to size tip circle radii, a bit similar to how the map uses collapsed tip count to size deme circles.
I think that (2) would be a first step towards semantic zoom. Here, I'm thinking of an "overview" tree that scaffolds out genetic diversity (in this case scaffolded based on PANGO lineage) and shows tips as circles of various sizes. Clicking on a tip would bring up the normal tip info panel, but we could include a URL link to something like https://nextstrain.org/ncov/lineages/B.1.525 that would give a normal ~4000 tip Nextstrain tree of just B.1.525 samples. There might be more sophisticated ways to implement semantic zoom, but this would at least give some steps in this direction.
This is pretty cool! Thanks for doing this John, it's very interesting.
I like Trevor's additions, and was also thinking the same thing as his point 1 - we know some of the pango lineage samples sometimes end up pretty random places in our trees, so it would be good if we could avoid inadvertently sampling one of those as a 'representative'.
I also like Trevor's ideas in 2 both for sizing the tips and in linking to more dedicated builds, perhaps at least for a few of the largest lineages.
superseeded by nextclade tree.