sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

Access to full GenBank eukaryotic database

Open zilov opened this issue 7 months ago • 1 comments
trafficstars

Hi @ctb,

Thank you for developing sourmash! I’m working on contamination screening for de novo assemblies and exploring various approaches. The k=51, scaled=10,000 database you provided (#3504) performs impressively—7,000 scaffold signatures screened against all eukaryotes in ~20 minutes with multisearch script from brunchwater.

However, I’m concerned that k=51 with a high scaled may miss some assemblies contigs, which are not in GenBank and could lack exact k-mer matches. I’m currently building signatures for all complete and chromosome-level GenBank assemblies, but I’d like to explore the full GenBank eukaryotic database.

Could you share:

  • The size of the full k=21/31/51, scaled=1,000 databases?
  • Is access to these databases possible?
  • Any recommendations for contamination screening of de-novo genomes? I’m considering lowering k (e.g., 21 or lower) and scaled (e.g. 1000 or 100) values for higher resolution.

Thank you for your time and insights!

zilov avatar Apr 14 '25 13:04 zilov

hi @zilov apologies for long delay in responding -

The size of the full k=21/31/51, scaled=1000 databases is 206 GB. Here's the directory listing:

-r--r--r-- 1 ctbrown datalabgrp  48864550095 Jan 20 13:38 bilateria-minus-vertebrates.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp  15365831090 Jan 21 05:53 eukaryotes-additional.sig.zip
-r--r--r-- 1 ctbrown datalabgrp   1620001977 Jan 18 08:37 eukaryotes-other.sig.zip
-r--r--r-- 1 ctbrown datalabgrp   4874065730 Jan 19 07:57 fungi.sig.zip
-r--r--r-- 1 ctbrown datalabgrp   2317105128 Jan 18 09:50 metazoa-minus-bilateria.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp   7427923974 Jan 21 07:23 plants.k21.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp   9447404252 Jan 21 07:24 plants.k31.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp  11479154613 Jan 21 07:25 plants.k51.sig.zip
-r--r--r-- 1 ctbrown ctbrowngrp 119940397180 Jan 21 07:04 vertebrates.sig.zip

Note that only the plants are broken out into different k-mer sizes.

I can give you direct download access - drop me a note at [email protected]. (I'd prefer not to have them the default because they're so darn big!)

Any recommendations for contamination screening of de-novo genomes? I’m considering lowering k (e.g., 21 or lower) and scaled (e.g. 1000 or 100) values for higher resolution.

So here is where things get interesting 😭 . If you use smaller k sizes, you end up getting a lot of false positives. It seems that what is happening that 21-mer and 31-mer space is saturated by the sheer size of these genomes (e.g. mistletoe is 100 Gbp!) and so you get non-specific matches to k-mers. This is based on a number of different experiences including human decontam + mapping, where we see that we find things in human WGS that don't map.

The flip side of this is that if you use k=51 with bacteria, you don't "catch" as much, because k=51 is strain specific there - basically bacteria seem to evolve faster.

So, we built this approach, https://github.com/ctb/2025-sourmash-subtract-alt-sketch, where you analyze euks at k=51, then subtract euk matches k=21 and do bacterial matches.

Anyway, tl;dr I don't recommend using k=21 or k=31 with euk databases.

Happy to chat more!

ctb avatar Jun 22 '25 17:06 ctb

Another workflow dealing with the k=31/k=51 issue, for properly calculating f_unique_weighted: https://github.com/ctb/2025-sourmash-euk-gtdb-tax

ctb avatar Sep 20 '25 16:09 ctb