treetime icon indicating copy to clipboard operation
treetime copied to clipboard

Problem w/ branch len estimate with closely related leaves

Open phiweger opened this issue 4 years ago • 12 comments

In the TreeTime .nexus output I get a huge negative branch len followed by another large on for the corresponding leaves:

...
((7ab5cd3e-524b-4f3e-9951-04d783bcef78:28113.25709,5abe8078-fdb6-4e90-9075-314bc4238f48:28113.15326)NODE_0000024:-28112.91947,(dd518f5c-d48a-464d-bff3-4becb51ae5d5:0.00000,0e2e43b1-45fc-4a32-b584-d4db7b91e86b:0.00000)
...

Is this a bug or some numerical instability? How could I avoid this?

Thanks a lot!

phiweger avatar Sep 07 '20 12:09 phiweger

Further testing gives me the impression that (1) this does not always occur given the same input and (2) only occured when I add the --confidence flag to treetime.

phiweger avatar Sep 08 '20 10:09 phiweger

it this run a tree with four leaves with some identical dates/branch lengths? then it is likely a numerical instability when trying to invert a singular matrix.

rneher avatar Sep 10 '20 13:09 rneher

Yes, this is a larger tree (20+ leaves) but 3 of them are identical in their SNV alignment, but the dates are different. Is there a way around this instability, besides manually clipping the corresponding branch values to 0? The dates should help resolve polytomies, right?

phiweger avatar Sep 10 '20 20:09 phiweger

could you send me these data. I can't quite explain why this might happen and it would be good to fix.

rneher avatar Sep 22 '20 19:09 rneher

which data do you need? the alignment, dates, undated tree -- anything else?

phiweger avatar Sep 23 '20 13:09 phiweger

yes, those are what I would need.

rneher avatar Oct 07 '20 11:10 rneher

I think I might be having a similar error (if not I can open a new issue). When estimating date confidences using the marginal likelihood, some nodes will sporadically have very large intervals:

image

Rather than having intervals in the range of 100s of years, these nodes have confidence intervals of +100,000 years. These large intervals are somewhat random, in that rerunning the analyses moves them around. Any thoughts on why this might be occurring and if there's a solution?

ktmeaton avatar Apr 02 '21 17:04 ktmeaton

yes, this looks like there is a problem. My hunch is that there is some numerical accuracy problem.

rneher avatar Apr 11 '21 19:04 rneher

I was thinking numerical accuracy too. This is a large phylogeny with many small branches (1e-8). Would there be any value in rescaling the branch lengths before (ex. multiply them all by 1e4)?

ktmeaton avatar Apr 13 '21 14:04 ktmeaton

I suppose this is a large genome? Does this use a SNP only alignment? Or a vcf file? TreeTime carries around an internal scale that is one_mutation = 1/L (L being the length of the genome). One could just try to trick it in assuming the genome is shorter. But I am not sure I understand your application well enough.

rneher avatar Apr 14 '21 09:04 rneher

I think I might be having a similar error (if not I can open a new issue). When estimating date confidences using the marginal likelihood, some nodes will sporadically have very large intervals:

image

Rather than having intervals in the range of 100s of years, these nodes have confidence intervals of +100,000 years. These large intervals are somewhat random, in that rerunning the analyses moves them around. Any thoughts on why this might be occurring and if there's a solution?

I am having this same (or a similar) issue on a SARS-CoV-2 dataset with roughly 5000 sequences using the flags, however it occurs without the covariation or branch-length-mode flags as well:

-tree ml_clean.nwk --dates clean_metadata.tsv --aln aln_clean.fasta --clock-filter 4 --reroot EPI_ISL_402125 --covariation --coalescent skyline --clock-rate 0.001 --clock-std-dev 0.0005 --branch-length-mode joint --confidence --keep-polytomies

I'm using a full alignment. The problem is random and rerunning on the same dataset can generate reasonable confidence intervals, but it happens often enough that it is an issue. Using TreeTime v. 0.80 on Python v3.9. I've attached the treetime output as well as the ML tree and a list of accession numbers (can't share alignment because GISAID data).

for_github.zip

m-a-martin avatar Jun 22 '21 12:06 m-a-martin

Sorry, just started to pick this up again. All the numbers in the dates.tsv file look sensible and these should be the same as in the graph -- with the exception of those labeled as problematic branches which are masked in the dates.tsv and not in the graph. My hunch is that these long bars are essentially undefined confidence intervals of branches that don't follow the clock to an extend that we can rely on this estimation. I'll add a line to exclude these from the graph.

rneher avatar Sep 28 '21 14:09 rneher