gtdb-taxdump icon indicating copy to clipboard operation
gtdb-taxdump copied to clipboard

question on GTDBr214.1 gtdb taxdump file regarding taxID

Open aababc1 opened this issue 7 months ago • 6 comments

Hello Thank you for your nice work. I downloaded GTDB taxa to utilize it for kraken database. (the taxonomy files you've created) I utilized GTDBr214.1 taxdump files.

I just found out one specific taxa is not aligned with seven level taxnonmy (domain to species). I did it directly on the downloaded taxonomy dataset too. $grep 1830337315 * GTDBr214.1_taxid_taxonomy:GCA_003162175.1 1830337315 Archaea;Halobacteriota;Bog-38;Bog-38 sp003162175;003162175

$taxonkit lineage <(echo 1830337315) --data-dir /data1/DBs/kraken2/gtdbr214.1/gtdb-taxdump/R214.1/ 1830337315 Archaea;Halobacteriota;Bog-38;Bog-38 sp003162175;003162175

I think duplicated names are removed , that have same names in different taxonomy units somehow.

In officail GTDB site, they have duplicated names in different taxonomy unit. image

I don't know it's removed during taxonkit execution or taxdump file creation . Can you inspect about it?

Thank you very much.

aababc1 avatar Dec 06 '23 14:12 aababc1

Yes, it's "removed" during taxdump file creation.

There are some doc in the help message:

$ taxonkit create-taxdump -h
Attentions:
  1. Names should be distinct in taxa of different ranks.
     But for these missing some taxon nodes, using names of parent nodes is allowed:

       GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

     It can also detect duplicate names with different ranks, e.g.,
     the Class and Genus have the same name B47-G6, and the Order and Family
     between them have different names. In this case, we reassign a new TaxId
     by increasing the TaxId until it being distinct.

       GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

shenwei356 avatar Dec 06 '23 21:12 shenwei356

Thank you for your reply. I wonder that when It is used for the downstream analysis.

When I used your taxdump files for kraken2 database comrised of GTDBr214.1 Species representative 85202 genome , the Kraken and bracken report file report the absence of specific taxonomic unit.

As I understancd, the GTDB database taxonomy units holded in specific name placeholders such as case I mentioned, are genuine taxonomy that should be considered in analysis .

When I look at the kraken report results, the duplicated intermediate taxonomy unit names (such as class order family ) are just omitted would affect taxonomy abundance analysis in those rank.

or I could misunderstand somepoint the way of action in kraken2 and bracken taxonomy processing.

When I converted the bracken to mpa style taxonomic composition report file by using KrakenTools provided by kraken2 authos, they produced output files in this way.

bracken2mpa:d__Archaea|p__Halobacteriota|c__Bog-38 0.0002 bracken2mpa:d__Archaea|p__Halobacteriota|c__Bog-38|s__Bog-38_sp003158275 0.0002

In the species level or phylum class , the reports will be complete , but regarding family and order level , the information will be just vanished in my guess .

Though Kraken2 and bracken is not taxdump, the analysis are heavily dependent on taxonomy files, so I wonder your thought about it.

Thank you very much

aababc1 avatar Dec 07 '23 04:12 aababc1

I understand your worries. In practice, we only summarize at rank phylum and species.

Besides, for predictions with an abundance lower to 0.0002, which probably are false positives.

You can also ask if KrakenTools can support these cases.

shenwei356 avatar Dec 07 '23 08:12 shenwei356

Okay. I get the points what you are saying. There could be some viable approaches I can adapt. Thank you very much for your comment.

aababc1 avatar Dec 07 '23 09:12 aababc1

Hello. I am using GTDBr220 gtdb-taxdump information . I asked once about, missing taxonomic rank in gtdb-taxdump . Now, taxonomy information seems to be changed according to aligning with GTDB's taxonomy classificaiton system. https://github.com/shenwei356/taxonkit/issues/92.

Regarding this I have two question . 1 : In the r220 GTDB taxonomy files, are you using full taxonomy as you changed in taxonkit 0.16.2(by allowing duplicated names in different rank)? 2 : If first question has been already modified as you did in taxonkit 0.16.2, all previous versions taxonkit could not be used for updated GTDB taxonomy ?

Thank you for your great contribution wei.

aababc1 avatar Apr 29 '24 10:04 aababc1

1 : In the r220 GTDB taxonomy files, are you using full taxonomy as you changed in taxonkit 0.16.2(by allowing duplicated names in different rank)?

Yes. https://github.com/shenwei356/taxonkit/issues/92#issuecomment-1979758849

2 : If first question has been already modified as you did in taxonkit 0.16.2, all previous versions taxonkit could not be used for updated GTDB taxonomy ?

0.16.2 is not released yet :), It's 0.16.0.

all previous versions taxonkit could not be used for updated GTDB taxonomy ?

Yes, old taxonkit versions can still be used for updated GTDB taxonomy as the taxudmp file format is not changed.

taxonkit v0.2.5 (Oct 12, 2018)

$ ./taxonkit version 
taxonkit v0.2.5

Checking new version...
New version available: taxonkit v0.16.0 at https://github.com/shenwei356/taxonkit/releases/tag/v0.16.0

$ echo 1662163052 | ./taxonkit lineage --data-dir gtdb-taxdump/R220/
1662163052      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445;009780445

The latest

$ taxonkit version 
taxonkit v0.16.0
$ echo 1662163052 | taxonkit lineage --data-dir gtdb-taxdump/R220/
1662163052      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445;009780445

shenwei356 avatar Apr 29 '24 12:04 shenwei356