sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

A potential bug? `search` not reporting matches w/identical md5s

Open chunyuma opened this issue 2 years ago • 2 comments
trafficstars

Hi @ctb,

I built a test database where I duplicated a genome but given two different names 48246288 and 482462:

internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/e1215ff0aa1e9c23122889d108385737.sig.gz,e1215ff0aa1e9c23122889d108385737,e1215ff0,31,DNA,0,100,79324,1,48246288,../all_genomes/482462.fna
signatures/e1215ff0aa1e9c23122889d108385737.sig.gz_0,e1215ff0aa1e9c23122889d108385737,e1215ff0,31,DNA,0,100,79324,1,482462,../all_genomes/482462.fna

When I ran

sourmash search signatures/e1215ff0aa1e9c23122889d108385737.sig.gz_0 test_k31_scale100.zip --ignore-abundance --containment --estimate-ani-ci -t 0 -n 0 -o test.csv

to find the most similar genomes for 482462, I expect it would return both 48246288 and 482462but only 48246288:

$ cat test.csv | cut -d',' -f 4
name
48246288
2877941
3003260
1336794
2511995
2653858
749
716541
1100841

However, when I ran

sourmash search signatures/e1215ff0aa1e9c23122889d108385737.sig.gz test_k31_scale100.zip --ignore-abundance --containment --estimate-ani-ci -t 0 -n 0 -o test1.csv 

for 48246288 , it also returns 48246288 only:

$  cat test1.csv | cut -d',' -f 4
name
48246288
2877941
3003260
1336794
2511995
2653858
749
716541
1100841

Is it a bug? Or it is normal because these two genomes are the same and haves the same MD5 hash, so only one is reported?

chunyuma avatar Oct 09 '23 19:10 chunyuma

I'll look into it! It should be reporting both matches. Thank you for providing so much info!

ctb avatar Oct 10 '23 11:10 ctb

Also found in #3284 by @agombolay!

This is caused by the following code, which intentionally removes duplicate sketches from consideration:

https://github.com/sourmash-bio/sourmash/blob/bc2297050b6db10b144916700b4550ec50b26a8f/src/sourmash/search.py#L685-L691

As I wrote in #3284,

... search is collapsing identical sketches and only reporting one result per distinct sketch while multisearch is reporting one result per matching database entry. The latter behavior is IMO better because it reports all matching names, while the former behavior will let you know that there's at least one match, but not report all of the matches.

Note that the new multisearch in the branchwater plugin is much faster and returns the correct set of results.


This behavior exists throughout search - also in the cos similarity code:

https://github.com/sourmash-bio/sourmash/blob/bc2297050b6db10b144916700b4550ec50b26a8f/src/sourmash/search.py#L735-L739


If we want to fix it, the quickest fix would be to supply an additional flag to search that says not to collapse identical sketches.

An alternative (or addition) would be to follow through on some parts of the refactoring suggested in #2002

Thoughts? My sense is that multisearch in the branchwater plugin provides an adequate solution to the search problem, but it's easy to add the flag if that is of interest!

ctb avatar Aug 16 '24 13:08 ctb