Mash
Mash copied to clipboard
Mash dist is binary!!!
Hi!
I have a big problem with last mash version (2.2). When I want to calculate distance matrix between 2 genomes, it returns a binary matrix (p-value is either 0 or 1). I tried with a fake genome (10k bp), and it works (I have float values). But with real bacterial genomes, it returns a binary matrix.
I attach the files I used for these tests. Here are the results I get with mash dist genome1.txt genome2.txt
and so on:
genome1 and genome2: I get a distance of 0.0379382 genome3 and genome4: I get a distance of 1, whereas I should have 0.295981
Why do you say "whereas I should have 0.295981" ?
Perhaps try fastANI
? https://github.com/ParBLiSS/FastANI
% seqkit stat genome?.txt
file format type num_seqs sum_len min_len avg_len max_len
genome1.txt FASTA DNA 1 5,747 5,747 5,747 5,747
genome2.txt FASTA DNA 4 10,722 6 2,680.5 5,608
genome3.txt FASTA DNA 1 1,587,120 1,587,120 1,587,120 1,587,120
genome4.txt FASTA DNA 78 2,997,537 536 38,430 666,660
% mash triangle genome?.txt
4
genome1.txt
genome2.txt 0.0379382
genome3.txt 1 1
genome4.txt 1 1 1
Max p-value: 1
Indeed, I forgot this information. 0.295981 is the distance I obtained before updating mash to this last version. So maybe the "real distance" is not exactly that, but it should be close.
So, when you try with mash triangle (or mash sketch for the whole matrix), you obtain the same result as me. A distance between 2 small genomes, but 1 if at least 1 of the genomes is "big". With my previous version, I had distances for all couples. Do you know why it doesn't work anymore?
Thanks!
What version did you get the result you wanted on? Are you using 2.2 or 2.2.1 for the "bad" result?
I was using version 1.1.1 to get the results with "float numbers"
I updated to version 2.2 (the last release), with which I get those binary distances.
FYI 2.2.2 is the latest tag, but not a 'release' per se. Nothing related to your issue has changed though. But it is faster at getting the possibly wrong answer :) https://github.com/marbl/Mash/tags .
Ok. So it is not planned to correct this bug?
@aperrin I am not the author of this package. I'm just a global github citizen. Ask @ondovb . I'm not even convinced its a bug. The v1 behaviour may be the bug. Your genomes are totally different in size. They should not have a good mash distance in my opinion!
Ok! But I tried with 2 "big" genomes of the same size, and it does not work either. Whereas with 2 "small" genomes of the same size it works. Anyway, thanks a lot for your help!