ncov
ncov copied to clipboard
Heatmap coloring [Feature request]
As an alternative for subsampling the genes to show in each node I would like to have a possibility to color the node according to its relative density (the number of genomes in each node).
Is it possible to add a color option "heatmap"? which will range the colors by the number of genomes (density) in a node compared to the overall number of genomes?
I'm not quite sure what you mean here -- are you asking for a visualisation where internal nodes (branches) are coloured according to the proportion of descendants they have? Currently each terminal node in the visualisation represents one genome.
Hi James, thanks for looking into my request!
I can see it was a little unclear reading it again. Not quite the number of descendants (which I assume also includes the number of genomes in all child nodes), rather the exact number of genomes annotated to each specific node (excluding any parents and descendant) Basically I would like to be able to color nodes by their weight (where the weight is the number of genomes annotated to the node). Something similar to the image in the link below, but where green nodes represents many genomes in an internal node and red few genomes.
Example (max) genomes - dark green 100 genomes - light green 10 genomes - light red 1 genome (or min) - dark red
Feel free to choose any color range. Hue or transparency may be an even better option!
https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0147475.g002
I still don't quite understand. In a phylogenetic tree like the one displayed in auspice, nodes are either terminal (tips), in which case they represent a single sampled genome, or they are internal in which case they represent an inferred (unsampled) genome.
I may not be understanding either, but is the idea that we represent the total number of identical genomes? So on an internal node, pre-subsampling, we'd count how many are identical to that internal node and colour by this? However, we don't strictly sample just based on genetic identity - we still end up with identical genomes in the tree, so I don't know how these would be counted. And again, may be misinterpreting entirely...
I suppose it is easier if we look at the same tree (as you have quite some options) I´m looking at the unrooted tree with divergence branch length. In this tree I wonder how many genomes end up in the main 4-6 nodes compared to the leaf nodes. First of all I think that could differentiate genomes containing sequencing errors (if it appears only very few times) and actually established mutations. Looking at this graph in time, perhaps one can see a shift of which mutations are the most common to spread or if there seem to be a stock from which a few mutations may develop in individuals for example.
Thanks again, and thanks for the platform it is a very informative tool!
This example explains exactly what I am looking for. The "count" in this case would represent the number of assemblies in each internal node. But again this would make little sense in the time adjusted data, it is easier to imagine in the unrooted tree where each genome will be assigned to a node defined by which SNPs it has and where all other genomes with the same set of SNPs are assigned to the same node.
https://stackoverflow.com/questions/45430451/how-to-color-the-density-of-dots-in-scatter-plot-using-r
If you can come up with an algorithm to assign these density values across the tree, then Auspice will be able to display it similar to your example. But it's not clear to me what these density values would mean in a phylogenetic context, or how they are constructed.
I can imagine a script which iterates over all sequences not in the current build (i.e. excluded via subsampling) and associates them with the closest node (or set of nodes) in the tree via hamming distance. I'm not sure if this is what you're after, but might be an interesting experiment.
I'd view this issue as an aspect of the "semantic zoom" that we've discussed in Auspice, where you could collapse multiple nodes into nodes with larger radii. I do think it's important to consider semantic zoom as a strategy to work with larger data sets.