dashing2
dashing2 copied to clipboard
Differences in mash distances between dashing2 vs dashing1/mash
I am puzzled about the (for us) large dashing1/dashing2 differences for mash distances for some genome pairs. In the example below with dashing1 and mash we get values around 0.25 - which reflects that these genomes are not related.
However, dashing2 reports a much lower distance value of 0.11 which strongly suggests a relationship - which is not there.
Is there a way to reproduce dashing1 results with dashing2?
Below are the steps to reproduce with the genomes from NCBI genomeA https://www.ncbi.nlm.nih.gov/nuccore/KC139520.1?report=fasta
genomeB https://www.ncbi.nlm.nih.gov/nuccore/MZ375344.1?report=fasta
dashing sketch -k 15 -S16 -p 1 -F list_of_genomes
dashing cmp -p 1 -M -W -k15 -S16 -F list_of_genomes -Q list_of_genomes -T -O result.tab -o result.labels
cat result.tab
genomeA.fasta -0 0.252614
genomeB.fasta 0.252614 -0
dashing2 sketch -p 1 -k 15 -S 65536 -F list_of_genomes
dashing2 cmp -p 1 --cmpout result_D2.out -F list_of_genomes -Q list_of_genomes --mash-distance
cat result_D2.out
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:32;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 0.115187205
genomeB.fasta 0.115187205 -0
Hi,
Thanks both for the bug report and making it easy to reproduce. We'll get this looked into and patched up in the next day or so.
Best,
Daniel
Hi - I've found the issue.
In the last line for Dashing2, you need to specify the k-mer length and sketch size again. Dashing2 was defaulting to 31, while Dashing1 was using k = 15.
> dashing2 cmp -p 1 --cmpout /dev/stdout -F list_of_genomes -Q list_of_genomes --mash-distance -k 15
#Calling Dashing2 version v2.1.11-4-gd23a with command '/Users/dnb13/Desktop/code/dashing2/dashing2 cmp -p 1 --cmpout /dev/stdout -F list_of_genomes -Q list_of_genomes --mash-distance -k 15'
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources KC139520.fasta MZ375344.fasta KC139520.fasta MZ375344.fasta
KC139520.fasta -0 0.26303053
MZ375344.fasta 0.26303053 -0
(base) Daniels-MacBook-Pro:dashing2 dnb13$ dashing cmp -p 1 -k15 -S16 -F list_of_genomes -Q list_of_genomes -T -O /dev/stdout -o result.labels -M
Dashing version: v1.0-5-g2afa
KC139520.fasta -0 0.252614
MZ375344.fasta 0.252614 -0
So, to make the above work, you would need to modify the last Dashing2 call to make sure it's using the same -S parameter and the same -k parameter.
Could you give that a try?
Thanks,
Daniel
unfortunately it did not work ... When only using "-k 15" in the second Dashing2 call, I get sensible values - but I can see that it created new sketches with the default sketch size 1024.
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 0.26303053
genomeB.fasta 0.26303053 -0
If I use both parameters in the second call "-k 15 -S 65536" I get inf values
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:65536;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 inf
genomeB.fasta inf -0
Hi again,
It's been a long time, but I was finally able to track that down. That was a bug introduced when trying to add M1 support properly.
I've patched it up in https://github.com/dnbaker/dashing2/pull/72, and updated binaries are https://github.com/dnbaker/dashing2-binaries/tree/main/linux/v2.1.14.
Happy to reopen as-needed, but I'm closing for now as it was a lot of work to track down.
Thanks again!
Daniel
Thank you for looking into this. Unfortunately, using the toy example "-k 15 -S 65536" still returns inf values.
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:65536;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 inf
genomeB.fasta inf -0
Testing different sketch sizes, it seems that after 4630 inf values show up
-S4630 works
dashing2 cmp -p 1 -F list_of_genomes -Q list_of_genomes --mash-distance -k 15 -S 4630
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:4630;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 0.26334497
genomeB.fasta 0.26334497 -0
-S4631 does not work anymore
dashing2 cmp -p 1 -F list_of_genomes -Q list_of_genomes --mash-distance -k 15 -S 4631
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:15;parsebyfile;trimchr;sketchsize:4631;sketchtype:onepermsetsketch;Fastx;canon
#Sources genomeA.fasta genomeB.fasta genomeA.fasta genomeB.fasta
genomeA.fasta -0 inf
genomeB.fasta inf -0
It happens both with the binary and the compiled version. /T
Thanks for checking! It seems that for large sketches/small inputs, the default one-permutation setsketch approach is running into empty registers and something is going wrong when accounting for it.
You can add --full-setsketch
to the comparison for now to use a slower sketching method that won't have this issue. I'll think about how to handle this issue in the long-term after.
Best,
Daniel