sourmash
sourmash copied to clipboard
Access to full GenBank eukaryotic database
Hi @ctb,
Thank you for developing sourmash! I’m working on contamination screening for de novo assemblies and exploring various approaches. The k=51, scaled=10,000 database you provided (#3504) performs impressively—7,000 scaffold signatures screened against all eukaryotes in ~20 minutes with multisearch script from brunchwater.
However, I’m concerned that k=51 with a high scaled may miss some assemblies contigs, which are not in GenBank and could lack exact k-mer matches. I’m currently building signatures for all complete and chromosome-level GenBank assemblies, but I’d like to explore the full GenBank eukaryotic database.
Could you share:
- The size of the full k=21/31/51, scaled=1,000 databases?
- Is access to these databases possible?
- Any recommendations for contamination screening of de-novo genomes? I’m considering lowering k (e.g., 21 or lower) and scaled (e.g. 1000 or 100) values for higher resolution.
Thank you for your time and insights!
hi @zilov apologies for long delay in responding -
The size of the full k=21/31/51, scaled=1000 databases is 206 GB. Here's the directory listing:
-r--r--r-- 1 ctbrown datalabgrp 48864550095 Jan 20 13:38 bilateria-minus-vertebrates.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp 15365831090 Jan 21 05:53 eukaryotes-additional.sig.zip
-r--r--r-- 1 ctbrown datalabgrp 1620001977 Jan 18 08:37 eukaryotes-other.sig.zip
-r--r--r-- 1 ctbrown datalabgrp 4874065730 Jan 19 07:57 fungi.sig.zip
-r--r--r-- 1 ctbrown datalabgrp 2317105128 Jan 18 09:50 metazoa-minus-bilateria.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp 7427923974 Jan 21 07:23 plants.k21.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp 9447404252 Jan 21 07:24 plants.k31.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp 11479154613 Jan 21 07:25 plants.k51.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp 119940397180 Jan 21 07:04 vertebrates.sig.zip
Note that only the plants are broken out into different k-mer sizes.
I can give you direct download access - drop me a note at [email protected]. (I'd prefer not to have them the default because they're so darn big!)
Any recommendations for contamination screening of de-novo genomes? I’m considering lowering k (e.g., 21 or lower) and scaled (e.g. 1000 or 100) values for higher resolution.
So here is where things get interesting 😭 . If you use smaller k sizes, you end up getting a lot of false positives. It seems that what is happening that 21-mer and 31-mer space is saturated by the sheer size of these genomes (e.g. mistletoe is 100 Gbp!) and so you get non-specific matches to k-mers. This is based on a number of different experiences including human decontam + mapping, where we see that we find things in human WGS that don't map.
The flip side of this is that if you use k=51 with bacteria, you don't "catch" as much, because k=51 is strain specific there - basically bacteria seem to evolve faster.
So, we built this approach, https://github.com/ctb/2025-sourmash-subtract-alt-sketch, where you analyze euks at k=51, then subtract euk matches k=21 and do bacterial matches.
Anyway, tl;dr I don't recommend using k=21 or k=31 with euk databases.
Happy to chat more!
Another workflow dealing with the k=31/k=51 issue, for properly calculating f_unique_weighted: https://github.com/ctb/2025-sourmash-euk-gtdb-tax