Mash icon indicating copy to clipboard operation
Mash copied to clipboard

Bootstrapping Mash

Open kloetzl opened this issue 5 years ago • 0 comments

Alignment-free methods are commonly criticized for lacking support of bootstrapping. Indeed, there so far have been few papers on computing support values without a MSA (1, 2). However, I think that Mash has the potential to implement true bootstrapping for more confident estimations. Quoting your 2016 paper:

Because S(A∪B) is a random sample of A∪B, the fraction of elements in S(A∪B) that are shared by both S(A) and S(B) is an unbiased estimate of J(A,B).

If one chose a different random sample one would get a different but hopefully similar estimate. These bootstrapped distances would lead to bootstrapped distance matrices and bootstrapped phylogenies. Given a number of them, one could compute the consensus tree and support values for each branch.

As the sample is mainly dependent on the hash values, getting a different sample should be as easy as using a different seed value in the hash function (an old attempt of mine). Unfortunately, the seed parameter of MurmurHash does not contribute enough to the hash value for it to be a completely new sample. One could instead switch to SipHash which will not only be slower but will have another string of consequences.

So, yeah; I think bootstrapping would be a cool feature that could provide a big benefit (support values) to resulting phylogenies.

kloetzl avatar Mar 20 '19 11:03 kloetzl