Question: Building & Using Large Custom Ganon Databases with Merged Taxonomies

Open ohickl opened this issue 11 months ago • 1 comments

Hi Vitor,

I'm planning to build a large custom database similar to what I asked about in #227 and have questions regarding both the taxonomy structure and managing the database size.

Goal 1: Merged Taxonomy Database

My primary objective is to build a Ganon database (or set of databases) that effectively combines:

Non-prokaryotic sequences (e.g., Eukaryotes, Viruses) classified using standard NCBI taxonomy.
Prokaryotic sequences (e.g., Bacteria, Archaea) classified using the latest GTDB taxonomy.

The conceptual goal is similar to workflows described for tools like FlexTaxD (FlexTaxD Walkthrough: Merge NCBI and GTDB), where prokaryotic branches of NCBI are replaced with GTDB.

Goal 2: Managing Large Database Size & Parallelism

Given the potentially large size of this combined dataset, I need an efficient way to manage the database build and classification, ideally leveraging a cluster environment.

Questions Regarding Merged Taxonomy:

Based on the documentation and a (superficial) look at the source code, it seems ganon build-custom expects a single, consistent taxonomy input via --taxonomy or pre-formatted --taxonomy-files (nodes.dmp, names.dmp).

Is the recommended approach for my goal to use an external tool (like FlexTaxD or custom scripts) to generate merged nodes.dmp and names.dmp files, and then supply these to ganon build-custom via --taxonomy-files alongside a correctly mapped --input-file?
Are there known pitfalls or specific formatting requirements for these externally merged taxonomy files? Does Ganon have any native support planned for facilitating such taxonomy merges directly within the tool?

Questions Regarding Database Splitting & Parallelism:

ganon classify supports using multiple database prefixes (--db-prefix db1 db2 ...), potentially in a hierarchy:

What is the recommended strategy for splitting a very large dataset (like the NCBI non-prok + GTDB prok described above) into multiple Ganon databases for efficient building and classification on a cluster?
Is it feasible to achieve relatively even database sizes across splits to optimize memory usage and parallel processing on cluster nodes? The build process seems dependent on unique k-mer content rather than just sequence volume, making pre-build size estimation hard. Are there any rules of thumb?
Specifically, could I build two separate databases:
1. db_nonprok using --taxonomy ncbi (and NCBI non-prok sequences).
2. db_prok using --taxonomy gtdb (and GTDB prok sequences).
Then run classification using ganon classify --db-prefix db_nonprok db_prok ...? Would this effectively achieve the goal of using the appropriate taxonomy for each domain during classification and reporting? Is this approach advisable compared to a single database with an externally merged taxonomy?
Since ganon build seems to produce one index per run, does splitting the input data into logical chunks (e.g., by taxonomy) and running multiple ganon build-custom jobs in parallel represent the intended way to parallelize the build process for large datasets?

Thanks for your help and for developing Ganon!

Best

Oskar

Apr 29 '25 07:04 ohickl

Using a nodes.dmp and names.dmp in the same format as the NCBI tax. should work fine with ganon, there's a similar example in the documentation. There's currently no intention to support merging taxonomies with ganon. Since the NCBI and GTDB are based on very different strategies, I imagine it may be not straight forward to summarize data based on both taxonomies in a proper way (e.g. compare abundances of species between NCBI and GTDB nodes).

Unfortunately, there's no easy way to tell the final database size based on sequence files, but you can have an approximated idea looking at the numbers from the documentation. If you have the memory available, I'd build one database for each set as you suggested (db_nonprok, db_prok), that would work fine for the taxonomies. There's little impact in performance if databases are used in the same hierarchical level for classification.

I would just suggest to further parallelize the build of the database in chunks if you really need the flexibility to have one index for each organism group (for example) or not enough memory. But keep in mind that if you want to classify reads at several databases at the same time (same hierarchical level), they need to be all loaded in memory.

Hope that helps, there were lots of points. Let me know if you need more details on some of them.

Apr 30 '25 15:04 pirovc