sourmash
sourmash copied to clipboard
A potential bug? `search` not reporting matches w/identical md5s
Hi @ctb,
I built a test database where I duplicated a genome but given two different names 48246288 and 482462:
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/e1215ff0aa1e9c23122889d108385737.sig.gz,e1215ff0aa1e9c23122889d108385737,e1215ff0,31,DNA,0,100,79324,1,48246288,../all_genomes/482462.fna
signatures/e1215ff0aa1e9c23122889d108385737.sig.gz_0,e1215ff0aa1e9c23122889d108385737,e1215ff0,31,DNA,0,100,79324,1,482462,../all_genomes/482462.fna
When I ran
sourmash search signatures/e1215ff0aa1e9c23122889d108385737.sig.gz_0 test_k31_scale100.zip --ignore-abundance --containment --estimate-ani-ci -t 0 -n 0 -o test.csv
to find the most similar genomes for 482462, I expect it would return both 48246288 and 482462but only 48246288:
$ cat test.csv | cut -d',' -f 4
name
48246288
2877941
3003260
1336794
2511995
2653858
749
716541
1100841
However, when I ran
sourmash search signatures/e1215ff0aa1e9c23122889d108385737.sig.gz test_k31_scale100.zip --ignore-abundance --containment --estimate-ani-ci -t 0 -n 0 -o test1.csv
for 48246288 , it also returns 48246288 only:
$ cat test1.csv | cut -d',' -f 4
name
48246288
2877941
3003260
1336794
2511995
2653858
749
716541
1100841
Is it a bug? Or it is normal because these two genomes are the same and haves the same MD5 hash, so only one is reported?
I'll look into it! It should be reporting both matches. Thank you for providing so much info!
Also found in #3284 by @agombolay!
This is caused by the following code, which intentionally removes duplicate sketches from consideration:
https://github.com/sourmash-bio/sourmash/blob/bc2297050b6db10b144916700b4550ec50b26a8f/src/sourmash/search.py#L685-L691
As I wrote in #3284,
...
searchis collapsing identical sketches and only reporting one result per distinct sketch while multisearch is reporting one result per matching database entry. The latter behavior is IMO better because it reports all matching names, while the former behavior will let you know that there's at least one match, but not report all of the matches.
Note that the new multisearch in the branchwater plugin is much faster and returns the correct set of results.
This behavior exists throughout search - also in the cos similarity code:
https://github.com/sourmash-bio/sourmash/blob/bc2297050b6db10b144916700b4550ec50b26a8f/src/sourmash/search.py#L735-L739
If we want to fix it, the quickest fix would be to supply an additional flag to search that says not to collapse identical sketches.
An alternative (or addition) would be to follow through on some parts of the refactoring suggested in #2002
Thoughts? My sense is that multisearch in the branchwater plugin provides an adequate solution to the search problem, but it's easy to add the flag if that is of interest!