gtdb_to_taxdump Using different starting NCBI taxonomy gives different results

Hey @shenwei356 and @nick-youngblut

Thank you so much for writing the python scripts to convert taxonomy between NCBI and GTDB. It will save so much time as compared to redoing the analysis using the new database!

I do have a query. When I use the NCBI taxonomy of a protein at given species or genus level, I get results that find LCA at genus and species level using ncbi-gtdb_map.py. eg- ncbi_taxonomy gtdb_taxonomy lca_frac target_tax_level d__Eukaryota unclassified NA NA s__Gordonibacter pamelaeae s__Gordonibacter pamelaeae 1.0 species s__Cohaesibacter haloalkalitolerans s__Cohaesibacter haloalkalitolerans 1.0 species s__Treponema primitia g__Treponema_E 1.0 genus s__Prevotella bergensis s__Prevotella bergensis 1.0 species s__Treponema socranskii s__Treponema_D socranskii 0.667 species

But when I group my NCBI taxonomy of a protein at family level, then ncbi-gtdb_map.py gives me mainly domain level or unclassified taxonomy.

eg- Same protein, but taking family level taxonomy as compared to species or genus level one. ncbi_taxonomy gtdb_taxonomy lca_frac target_tax_level f__Eggerthellaceae d__Bacteria 0.95 domain f__Cohaesibacteraceae d__Bacteria 0.94 domain f__Spirochaetaceae d__Bacteria 0.92 domain f__Prevotellaceae d__Bacteria 0.96 domain f__Porphyromonadaceae d__Bacteria 0.92 domain f__Anaplasmataceae d__Bacteria 0.93 domain

As the ncbi-gtdb_map.py finds the LCA of a given NCBI taxonomy in GTDB, does this result mean that family level annotation in GTDB is more stringent or there are less representatives giving the final LCA at domain level? Would you recommend that I use the species/genus level NCBI taxonomy to convert to GTDB?

Happy holidays! Please reply after the holidays, there is no rush!

Dec 30 '20 04:12 Jigyasa3

Thanks for your interest in ncbi-gtdb_map.py! I hope that you find it to be useful.

I've found the same: most taxonomic classifications at coarser levels than genus/species will just map to the Bacteria domain, unless you reduce the LCA stringency score (--fraction). This suggest that the coarser-level taxonomies often do not map "cleanly" between NCBI and GTDB. You can also more stringently filter the genomes used for NCBI-GTDB taxonomy mapping (--completeness & --contamination) if you suspect that some poorly assembled genomes are taxonomically mis-classified.

Dec 30 '20 05:12 nick-youngblut

Thank you so much for a prompt reply @nick-youngblut!

I will play with the LCA stringency score and completeness and contamination cut-off. I wanted to ask is redoing the taxonomic annotation with GTDB database more accurate than converting the NCBI annotation to GTDB annotation?

Dec 30 '20 06:12 Jigyasa3

is redoing the taxonomic annotation with GTDB database more accurate than converting the NCBI annotation to GTDB annotation?

I haven't tested that. I guess it would depend on the accuracy of you NCBI taxonomic annotations.

Dec 30 '20 06:12 nick-youngblut

Thanks! I am testing it on a gut metagenome sample, will also let you know if there are huge differences or not. Thanks again for all the help!

Dec 30 '20 06:12 Jigyasa3

Hey @nick-youngblut

I compared the ncbi-gtdb_map.py script (hereafter called as method1) to taxonomy annotated via gtdb_to_diamond.py (hereafter called as method2) generated all_gtdb.faa file as a database. The results are very different!

method1 results- gtdbdomain n 1 d__Archaea 178 2 d__Bacteria 2550

method2 results- bastadomain n 1 Bacteria 141 2 Eukaryota 2587

For method2, I am running diamond blastx of my shotgun sequenced metagenome contigs against gtdb.dmnd database, and then using BASTA https://github.com/timkahlke/BASTA/wiki to generate LCA from accession2taxid.tsv file. Why do you think there is such a huge difference in the taxonomy of the same file?

Jan 08 '21 06:01 Jigyasa3

Is you dataset likely comprised mostly of Eukaryotes? Note that GTDB is only bacteria and archaea, so all reads from eukaryote genomes will either hit nothing in the GTDB or wrongly hit a prokaryotic genome.

Jan 10 '21 07:01 nick-youngblut

I went back to check the taxon ids assigned to gtdb accession numbers in accession2taxid.tsv generated via gtdb_to_diamond.py script. I think there is something wrong in the accession2taxid.tsv file.

For example, GCA001818315.1 accession is assigned to taxon id 164650 in the accession2taxid.tsv file. When I manually check the taxon id in the NCBI database, I get a Eukaryote hit https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=164650. In GTDB, this accession is d__Bacteria;p__Patescibacteria;c__ABY1;o__UBA9570;f__HO2-48-17;g__WO2-43-9;s__WO2-43-9 sp001818315

Similarly, GCA900321315.1 accession is assigned to taxon id 219029, which also gives a hit to a Eukaryote in NCBI https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=219029. In GTDB, this accession is d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__F082;g__F082;s__F082 sp900321315

I only checked a couple of accessions that gave a Eukaryote hit in my metagenome. Any suggestions? Maybe the GTDB to NCBI taxon id file is incorrect? Looking forward to your reply!

Jan 11 '21 02:01 Jigyasa3

The gtdb_to_diamond.py script just associates the protein sequence with the accessions via the file paths in the tarball (the accession is in the path for each protein fasta), so I don't see how that could be wrong. You can inspect the files in the tarball to see if something looks wrong.

GTDB to NCBI taxon id file is incorrect

That GTDB-NCBI mapping is just based on the GTDB metadata, so it should be correct, but I'm looking into making some small validation datasets to check that the algorithm that I'm using is working properly.

Jan 11 '21 14:01 nick-youngblut

gtdb_to_taxdump gtdb_to_taxdump copied to clipboard

Using different starting NCBI taxonomy gives different results

gtdb_to_taxdump
gtdb_to_taxdump copied to clipboard