Mash icon indicating copy to clipboard operation
Mash copied to clipboard

Mash dist is binary!!!

Open aperrin opened this issue 4 years ago • 8 comments

Hi!

I have a big problem with last mash version (2.2). When I want to calculate distance matrix between 2 genomes, it returns a binary matrix (p-value is either 0 or 1). I tried with a fake genome (10k bp), and it works (I have float values). But with real bacterial genomes, it returns a binary matrix.

I attach the files I used for these tests. Here are the results I get with mash dist genome1.txt genome2.txt and so on:

genome1 and genome2: I get a distance of 0.0379382 genome3 and genome4: I get a distance of 1, whereas I should have 0.295981

genome1.txt genome2.txt genome3.txt genome4.txt

aperrin avatar Oct 08 '19 09:10 aperrin

Why do you say "whereas I should have 0.295981" ? Perhaps try fastANI? https://github.com/ParBLiSS/FastANI

% seqkit stat genome?.txt
file         format  type  num_seqs    sum_len    min_len    avg_len    max_len
genome1.txt  FASTA   DNA          1      5,747      5,747      5,747      5,747
genome2.txt  FASTA   DNA          4     10,722          6    2,680.5      5,608
genome3.txt  FASTA   DNA          1  1,587,120  1,587,120  1,587,120  1,587,120
genome4.txt  FASTA   DNA         78  2,997,537        536     38,430    666,660

% mash triangle genome?.txt
	4
genome1.txt
genome2.txt	0.0379382
genome3.txt	1	1
genome4.txt	1	1	1
Max p-value: 1

tseemann avatar Oct 15 '19 06:10 tseemann

Indeed, I forgot this information. 0.295981 is the distance I obtained before updating mash to this last version. So maybe the "real distance" is not exactly that, but it should be close.

So, when you try with mash triangle (or mash sketch for the whole matrix), you obtain the same result as me. A distance between 2 small genomes, but 1 if at least 1 of the genomes is "big". With my previous version, I had distances for all couples. Do you know why it doesn't work anymore?

Thanks!

aperrin avatar Oct 16 '19 06:10 aperrin

What version did you get the result you wanted on? Are you using 2.2 or 2.2.1 for the "bad" result?

tseemann avatar Oct 16 '19 21:10 tseemann

I was using version 1.1.1 to get the results with "float numbers"

I updated to version 2.2 (the last release), with which I get those binary distances.

aperrin avatar Oct 28 '19 10:10 aperrin

FYI 2.2.2 is the latest tag, but not a 'release' per se. Nothing related to your issue has changed though. But it is faster at getting the possibly wrong answer :) https://github.com/marbl/Mash/tags .

tseemann avatar Oct 29 '19 03:10 tseemann

Ok. So it is not planned to correct this bug?

aperrin avatar Oct 29 '19 10:10 aperrin

@aperrin I am not the author of this package. I'm just a global github citizen. Ask @ondovb . I'm not even convinced its a bug. The v1 behaviour may be the bug. Your genomes are totally different in size. They should not have a good mash distance in my opinion!

tseemann avatar Oct 29 '19 22:10 tseemann

Ok! But I tried with 2 "big" genomes of the same size, and it does not work either. Whereas with 2 "small" genomes of the same size it works. Anyway, thanks a lot for your help!

aperrin avatar Nov 04 '19 08:11 aperrin