ncov
ncov copied to clipboard
Different tree topologies when running the same data
HI--I wonder if someone can help me think through what is going on. I'm trying a (not-too) custom build where I'm forcing inclusion of several (good) sequences. These should all be related. However, in most (but not all) builds of this tree, a subset of these sequences show up not clustered with the main group. Rather, they get put into a "non-clade" blob with long branches at the bottom of the tree.
After encountering the weird clustering, I futzed around with it, and after remaking the fasta and metadata file one more time, I got a build like I expected. However, after I copied the entire ncov directory, removed the contents of results/ and auspice/ and ran the same pipeline again, these sequences again get punted to the "non-clade" area. Seems maybe I got lucky with the tree construction that one time?
Anyone have thoughts on this?
Hi @donutbrew -- could you compare the alignments in each of the builds to ensure they're the same? We've found that they can differ a bit, and those differences would then be reflected in the tree.
Assuming the alignments are the same, the next step would be to compare the IQ-TREE output (you could get away with using tanglegrams in auspice, but change the metric to divergence). There's some level of stochasticity in topology resolution but I wouldn't expect this much from an identical alignment.
You bet.
This is a build using 811 sequences. The results/aligned.fasta are identical (diff) ,but all the files in results/global differ. That includes the subsampled_alignment ans sample-global fasta files--although the differences in chosen sequences looks reasonable. The nwk files differ quite a bit at the three outlier. The files in data/ are identical.
In the meantime, I'm constructing a tree with a larger set of sequences to see if that helps.
Here's the tanglegram again, in divergence:
Ok great, we can rule out mafft issues then. The subsampled alignments will differ because of the stochastic subsampling apporach, but no re-alignment is done. This does add in an extra possibility -- namely that the different subsets of the same alignment each results in different (stable) maxima being found by iq-tree. We know there are homoplasic mutations, and I can imagine how the resulting toplogy depends on what sequences are included.
I think the next step is to rerun the pipeline from the augur tree
stage a number of times, i.e. hold the subsampled alignment constant.
Good suggestions. I reran the tree and refine steps on the alignment several times, and what I got was that they come out differently every time--about half create that outgroup. It isn't super apparent in the initial tree, but after refinement, that's when the non-clade blob really appears. It is dependent on the initial tree, though.
After that, I reran the tree step again, passing the bootstrap parameter to iqtree. The bootstrap values are just low (most branches <50/100). I guess this isn't surprising based on the high similarity of most sequences. Has this been a concern for you guys?
Yes -- bootstraps will be low with so little diversity. We're currently looking into the different topologies and ways to improve this... (cc @rneher)
IQ-tree will spit out different tree topologies on different runs. we have been experimenting a bit with different settings but haven't found a good solution yet. In addition, there can be some stochasticity through the refine step. With more homoplasies and maybe recombination this problem might be getting worse...
@rneher Could you update on the progress of this? I tried bootstrapping with iqtree, raxml, and fasttree. It either takes very long time, or the refined tree is broken. For example, below is the broken iqtree when bootstraping
Leaving this open. We don't currently attempt to handle bootstrap values. I have noticed stochastic errors by IQ-TREE where some branches aren't resolved as they should be in particular daily builds. We should be thinking more broadly about how we're handling uncertainty.