micom icon indicating copy to clipboard operation
micom copied to clipboard

Support for GTDB taxonomy?

Open nick-youngblut opened this issue 3 years ago • 8 comments

Checklist

Is your feature related to a problem? Please describe it.

The Genome Taxonomy Database (GTDB) is comprehensive (especially the new v202 release) and more robust than the NCBI microbial taxonomy, especially given that the GTDB taxonomy is completely based off of genome phylogenic relatedness.

Although the MICOM docs are vague about the taxonomy that one must use, it appears that the NCBI taxonomy is required.

Describe the solution you would like.

Provide direct support for the GTDB taxonomy.

nick-youngblut avatar May 13 '21 07:05 nick-youngblut

MICOM doesn't really set any requirements for the taxonomy but you are right that you usually need the taxonomy of your data to match the taxonomy of the model database.

I also thought about providing the model databases with different taxonomies but haven't found a good way to map NCBI taxon IDs to GTDB ones. If you know of a way to do so that would be great. Otherwise, we would have to get all the original genomes from the database and classify them but that would be pretty involved because it is not straightforward to get the genomes for the AGORA models for instance.

cdiener avatar May 13 '21 15:05 cdiener

I also thought about providing the model databases with different taxonomies but haven't found a good way to map NCBI taxon IDs to GTDB ones

You could use or build on a simple script that I wrote to map the NCBI taxonomy to the GTDB taxonomy: ncbi-gtdb_map.py. It simply uses the metadata provided by the GTDB, which includes NCBI and GTDB taxonomies for each genome.

If you need to map at the taxid level, some of the other scripts in that repo might be useful.

nick-youngblut avatar May 13 '21 18:05 nick-youngblut

Oh cool, will try with that one.

cdiener avatar May 13 '21 23:05 cdiener