gubbins icon indicating copy to clipboard operation
gubbins copied to clipboard

Gubbins tree scale?

Open cizydorczyk opened this issue 4 years ago • 10 comments

Hello,

I'm having some trouble understanding the tree scale that is output in the newick formatted tree by Gubbins. Specifically, branch lengths in this tree are on the order of 10s for me, yet in trees produced by other software (iq-tree, etc.), branch lengths are always on the order of decimal numbers.

Are the branch lengths reported by Gubbins in # of SNPs? What exactly is being output here? Ultimately I want to compare trees between Gubbins and other software, but this is difficult if the branch lengths are reported in different units.

Any clarification is much appreciated.

Thank you, Conrad

cizydorczyk avatar Jan 22 '20 19:01 cizydorczyk

Hi Conrad,

I am wondering the exact same thing. Did you ever find out the answer?

Thanks,

Megan

mearls999 avatar Feb 28 '20 11:02 mearls999

Hi Megan,

No, I never was able to figure this out.

I use a workaround nowadays, where if I run Gubbins (instead of ClonalFrameML, which I prefer as it doesn't produce such a weird output tree), I use the script found here (https://github.com/kwongj/maskrc-svg) to mask recombinant regions from my original alignment, and re-build the tree myself using IQ-Tree.

The tree from IQ-Tree is then the one I use for any downstream analyses.

Hope that helps, Conrad

cizydorczyk avatar Feb 29 '20 22:02 cizydorczyk

RE: Maximum-Likelihood trees from Iqtree, raxml, fasttree etc The scale is not SNPs, it is substitutions per site. These are (subtly) different. ie. A->T->A is two subs but not a SNP

The inflated scale is probably due to only giving iqtree the variant sites, not the whole genome: https://bitsandbugs.org/2019/11/06/two-easy-ways-to-run-iq-tree-with-the-correct-number-of-constant-sites/

tseemann avatar Feb 29 '20 23:02 tseemann

I have had this problem when feeding Gubbins/IQ-Tree pseudo-whole genome alignments from Snippy; I'm not sure what's going on but I also have not given it as much investigation as I probably should!

cizydorczyk avatar Mar 01 '20 00:03 cizydorczyk

@cizydorczyk if you have an issue post it to https://github.com/tseemann/snippy/issues/new

tseemann avatar Mar 01 '20 20:03 tseemann

Thanks a million, Conrad!

mearls999 avatar Mar 04 '20 19:03 mearls999

Maybe this was answered and I just missed it but what is the scale Gubbins trees are outputted in? They aren't SNPs per site? Is there an option to as it to put the branch lengths in SNPs per site?

peflanag avatar Jun 15 '20 15:06 peflanag

I wanted to know the answer to this question too! As there was no answer here, I did some digging and found that if you take the Gubbins output file which contains the alignment of polymorphic sites ('.filtered_polymorphic_sites.fasta') and infer your own tree with RAxML, the tree that you get (which has branch lengths in subs per site in the input alignment, because that's what RAxML does) is essentially identical to the tree output by Gubbins but the Gubbins tree is scaled by a factor of the alignment length (ie the numebr of polymorphic sites). So, I am pretty sure the Gubbins output tree is scaled by alignment length to give you branch lengths in substitutions, rather than substitutions per site. (Note that the branch lengths are in substitutions, which is not exactly the same as a discrete SNP count as the branch length estimation accounts for the substitution model, reversals, missing data etc.) This means that when you analyse Gubbins trees with e.g. BactDating to get dated trees and estimate mutation rates, the mutation rate is expressed as substitutions per genome per unit time, not subs per site per unit time, so is comparable across different input trees made from alignments of different lengths.

It would be great if Nick or someone could confirm though...

katholt avatar Feb 16 '22 15:02 katholt

Thanks for the question, sorry for the slow response to the original issue - there are three different types of branch length in the output files, so this is certainly something worth clarifying:

  • [prefix].final_tree.tre (and [prefix].tre files from each iteration): these branch lengths are in substitutions per genome, as you suggest @katholt - the relevant code is in tree_scaling.c, where the branch lengths (in substitutions per site) are multiplied by the number of recombination-filtered SNPs
  • [prefix].joint.tre: this is an intermediate file, usually deleted, in which the branch lengths are the integer total number of base substitutions (SNPs and point mutations) reconstructed by the joint ancestral reconstruction algorithm
  • RAxML/IQtree/FastTree/RapidNJ intermediate files - these retain branch lengths in the usual substitutions per site units

Following BactDating, it should be the case that the molecular clock is in substitutions per genome per unit time. I hope that's helpful, I'll add this information to the manual.

nickjcroucher avatar Feb 16 '22 21:02 nickjcroucher

Updated manual in 1777808af1dc9e7fe82402bd3ca45928ca82814b.

nickjcroucher avatar Feb 16 '22 21:02 nickjcroucher