sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

searching database for any duplicates genomes

Open SAMtoBAM opened this issue 1 year ago • 3 comments

Hi there

I have received genomes from numerous sources, some previously public, some not but I don't know which So I have a set of genomes and want to see if any of them are identical to the larger complete public set of genomes

First, do you think sourmash be a suitable and fast option to determine this? Second, would there be an appropriately stringent kmer and scale options for building the signature databases?

Thanks a lot

SAMtoBAM avatar Mar 21 '24 19:03 SAMtoBAM

First, do you think sourmash be a suitable and fast option to determine this?

Yes, I think so. Using sourmash you could find genomes that were 99.9% identical (or so) to things that are in a databases.

Second, would there be an appropriately stringent kmer and scale options for building the signature databases?

The default options (scaled=1000, k=31) should be good for identifying candidates up to about 99.9% ANI similarity, and you would be able to use our public databases (GTDB or NCBI) with those parameters. You might need to think about how to investigate further after you find near-identical matches, though; if you want to identify only perfect matches, you should do post-processing of the sourmash results.

Let me know if that doesn't make sense or you have more questions ;)

ctb avatar Mar 21 '24 19:03 ctb

Thanks for the quick response Would increasing the kmer size to 51 or above help? considering the just want identical matches I would create my own smaller signature database for the public genomes so I could modify the kmer size there too

SAMtoBAM avatar Mar 21 '24 19:03 SAMtoBAM

oh, yes! then k=51, and/or lower scaled values (scaled=100, for example), would ensure perfect identity.

If only exact matches are needed, you can compare the md5sum of the signatures directly to find matches, without needing to do the search - if you do sourmash sig describe <sketchfiles>, or sketch everything to a zip file with sourmash sketch dna -p k=51,scaled=100 -o out.zip *.fa, and then do sourmash sig manifest out.zip -o out.mf.csv, you'll find md5 entries that will be the same if two sketches are the same.

HTH!

ctb avatar Mar 22 '24 03:03 ctb