gtdb_to_taxdump
gtdb_to_taxdump copied to clipboard
Using different starting NCBI taxonomy gives different results
Hey @shenwei356 and @nick-youngblut
Thank you so much for writing the python scripts to convert taxonomy between NCBI and GTDB. It will save so much time as compared to redoing the analysis using the new database!
I do have a query. When I use the NCBI taxonomy of a protein at given species or genus level, I get results that find LCA at genus and species level using ncbi-gtdb_map.py
.
eg-
ncbi_taxonomy gtdb_taxonomy lca_frac target_tax_level
d__Eukaryota unclassified NA NA
s__Gordonibacter pamelaeae s__Gordonibacter pamelaeae 1.0 species
s__Cohaesibacter haloalkalitolerans s__Cohaesibacter haloalkalitolerans 1.0 species
s__Treponema primitia g__Treponema_E 1.0 genus
s__Prevotella bergensis s__Prevotella bergensis 1.0 species
s__Treponema socranskii s__Treponema_D socranskii 0.667 species
But when I group my NCBI taxonomy of a protein at family level, then ncbi-gtdb_map.py
gives me mainly domain level or unclassified taxonomy.
eg- Same protein, but taking family level taxonomy as compared to species or genus level one. ncbi_taxonomy gtdb_taxonomy lca_frac target_tax_level f__Eggerthellaceae d__Bacteria 0.95 domain f__Cohaesibacteraceae d__Bacteria 0.94 domain f__Spirochaetaceae d__Bacteria 0.92 domain f__Prevotellaceae d__Bacteria 0.96 domain f__Porphyromonadaceae d__Bacteria 0.92 domain f__Anaplasmataceae d__Bacteria 0.93 domain
As the ncbi-gtdb_map.py
finds the LCA of a given NCBI taxonomy in GTDB, does this result mean that family level annotation in GTDB is more stringent or there are less representatives giving the final LCA at domain level?
Would you recommend that I use the species/genus level NCBI taxonomy to convert to GTDB?
Happy holidays! Please reply after the holidays, there is no rush!
Thanks for your interest in ncbi-gtdb_map.py
! I hope that you find it to be useful.
I've found the same: most taxonomic classifications at coarser levels than genus/species will just map to the Bacteria domain, unless you reduce the LCA stringency score (--fraction
). This suggest that the coarser-level taxonomies often do not map "cleanly" between NCBI and GTDB.
You can also more stringently filter the genomes used for NCBI-GTDB taxonomy mapping (--completeness
& --contamination
) if you suspect that some poorly assembled genomes are taxonomically mis-classified.
Thank you so much for a prompt reply @nick-youngblut!
I will play with the LCA stringency score and completeness and contamination cut-off. I wanted to ask is redoing the taxonomic annotation with GTDB database more accurate than converting the NCBI annotation to GTDB annotation?
is redoing the taxonomic annotation with GTDB database more accurate than converting the NCBI annotation to GTDB annotation?
I haven't tested that. I guess it would depend on the accuracy of you NCBI taxonomic annotations.
Thanks! I am testing it on a gut metagenome sample, will also let you know if there are huge differences or not. Thanks again for all the help!
Hey @nick-youngblut
I compared the ncbi-gtdb_map.py
script (hereafter called as method1) to taxonomy annotated via gtdb_to_diamond.py
(hereafter called as method2) generated all_gtdb.faa file as a database.
The results are very different!
method1 results- gtdbdomain n 1 d__Archaea 178 2 d__Bacteria 2550
method2 results- bastadomain n 1 Bacteria 141 2 Eukaryota 2587
For method2, I am running diamond blastx of my shotgun sequenced metagenome contigs against gtdb.dmnd database, and then using BASTA https://github.com/timkahlke/BASTA/wiki
to generate LCA from accession2taxid.tsv file. Why do you think there is such a huge difference in the taxonomy of the same file?
Is you dataset likely comprised mostly of Eukaryotes? Note that GTDB is only bacteria and archaea, so all reads from eukaryote genomes will either hit nothing in the GTDB or wrongly hit a prokaryotic genome.
I went back to check the taxon ids assigned to gtdb accession numbers in accession2taxid.tsv generated via gtdb_to_diamond.py script. I think there is something wrong in the accession2taxid.tsv file.
For example,
GCA001818315.1 accession is assigned to taxon id 164650 in the accession2taxid.tsv file. When I manually check the taxon id in the NCBI database, I get a Eukaryote hit https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=164650
.
In GTDB, this accession is d__Bacteria;p__Patescibacteria;c__ABY1;o__UBA9570;f__HO2-48-17;g__WO2-43-9;s__WO2-43-9 sp001818315
Similarly, GCA900321315.1 accession is assigned to taxon id 219029, which also gives a hit to a Eukaryote in NCBI https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=219029
. In GTDB, this accession is d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__F082;g__F082;s__F082 sp900321315
I only checked a couple of accessions that gave a Eukaryote hit in my metagenome. Any suggestions? Maybe the GTDB to NCBI taxon id file is incorrect? Looking forward to your reply!
The gtdb_to_diamond.py script just associates the protein sequence with the accessions via the file paths in the tarball (the accession is in the path for each protein fasta), so I don't see how that could be wrong. You can inspect the files in the tarball to see if something looks wrong.
GTDB to NCBI taxon id file is incorrect
That GTDB-NCBI mapping is just based on the GTDB metadata, so it should be correct, but I'm looking into making some small validation datasets to check that the algorithm that I'm using is working properly.