taxonkit icon indicating copy to clipboard operation
taxonkit copied to clipboard

Create-taxdump not generating subspecies

Open Lucas-Maciel opened this issue 2 years ago • 3 comments

Hi,

I'm trying to use taxonkit create-taxdump but I have two questions:

  1. I'm using the following command but all my "accession" names are being assigned as "no rank " instead of subspecies. Am I missing something?
$ taxonkit create-taxdump class.gtdb.tsv --field-accession-as-subspecies --gtdb --out-dir taxdump/
08:18:38.735 [WARN] --gtdb-re-subs failed to extract ID for subspecies, the original value is used instead. e.g., HumGut_30691
08:18:38.971 [INFO] 32264 records saved to taxdump/taxid.map
08:18:39.521 [INFO] 37770 records saved to taxdump/nodes.dmp
08:18:39.861 [INFO] 37770 records saved to taxdump/names.dmp
08:18:39.884 [INFO] 0 records saved to taxdump/merged.dmp
08:18:39.884 [INFO] 0 records saved to taxdump/delnodes.dmp

My input has the following format

MGG00015        d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__COE1;s__COE1 sp002358575
MGG00050        d__Bacteria;p__Firmicutes_A;c__Clostridia_A;o__Christensenellales;f__CAG-552;g__MGG03569;s__MGG03569 MGG00050
  1. Do you have any tips on how to safely integrate this taxdump file with the one provided by NCBI? I want for example to use this custom GTDB taxdump together with the NCBI viral and fungi database from Kraken2. But I'm worried about the conflicts between taxid numbers.

Thank you for your time Kind regards

Lucas-Maciel avatar Oct 05 '22 06:10 Lucas-Maciel

https://github.com/shenwei356/gtdb-taxdump#taxonomic-hierarchy

A GTDB species cluster contains >=1 assemblies, each can be treated as a strain. So we can assign each assembly a TaxId with the rank of "no rank" below the species rank. Therefore, we can also track the changes of these assemblies via the TaxId later.

Don't worry the "no rank" which is below the species rank, so it belongs to "subspecies".

609216830    superkingdom   Bacteria
947989846    phylum         Firmicutes_A
1797966051   class          Clostridia
1853814285   order          Lachnospirales
3217231047   family         Lachnospiraceae
1880979389   genus          COE1
2414110737   species        COE1 sp002358575
2538223356   no rank        MGG00015

shenwei356 avatar Oct 05 '22 14:10 shenwei356

Do you have any tips on how to safely integrate this taxdump file with the one provided by NCBI? I want for example to use this custom GTDB taxdump together with the NCBI viral and fungi database from Kraken2. But I'm worried about the conflicts between taxid numbers.

It's a great idea. I think my taxonomic profiling tool KMCP should also use this combined taxonomy. Previsouly, we use the NCBI taxonomy for reference genomes from GTDB and Refseq.

To achieve this, you need to create taxdump files with both the GTDB lineages and NCBI lineages of the viral and fungi in one run.

  1. Get the 7-rank lineages of viral and fungi taxa with taxonkit list | taxonkit reformat.
  2. Samely, get the 7-rank lineages of GTDB, either by directly reformating the GTDB taxonomy format or taxonkit create-taxdump --gtdb and taxonkit list | taxonkit reformat.
  3. Simply concatenate all the lineages and call taxonkit create-taxdump.

I'll add the steps to the tutorial, maybe next week (We're on holiday).

shenwei356 avatar Oct 05 '22 15:10 shenwei356

@shenwei356 thank you for your reply.

I'll try your instructions and check the KMCP as well.

Best,Lucas

Lucas-Maciel avatar Oct 11 '22 09:10 Lucas-Maciel

I've added some tutorials on Merging GTDB and NCBI taxonomy, which could help.

shenwei356 avatar Dec 29 '22 02:12 shenwei356