taxonkit
taxonkit copied to clipboard
Create-taxdump not generating subspecies
Hi,
I'm trying to use taxonkit create-taxdump but I have two questions:
- I'm using the following command but all my "accession" names are being assigned as "no rank " instead of subspecies. Am I missing something?
$ taxonkit create-taxdump class.gtdb.tsv --field-accession-as-subspecies --gtdb --out-dir taxdump/
08:18:38.735 [WARN] --gtdb-re-subs failed to extract ID for subspecies, the original value is used instead. e.g., HumGut_30691
08:18:38.971 [INFO] 32264 records saved to taxdump/taxid.map
08:18:39.521 [INFO] 37770 records saved to taxdump/nodes.dmp
08:18:39.861 [INFO] 37770 records saved to taxdump/names.dmp
08:18:39.884 [INFO] 0 records saved to taxdump/merged.dmp
08:18:39.884 [INFO] 0 records saved to taxdump/delnodes.dmp
My input has the following format
MGG00015 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__COE1;s__COE1 sp002358575
MGG00050 d__Bacteria;p__Firmicutes_A;c__Clostridia_A;o__Christensenellales;f__CAG-552;g__MGG03569;s__MGG03569 MGG00050
- Do you have any tips on how to safely integrate this taxdump file with the one provided by NCBI? I want for example to use this custom GTDB taxdump together with the NCBI viral and fungi database from Kraken2. But I'm worried about the conflicts between taxid numbers.
Thank you for your time Kind regards
https://github.com/shenwei356/gtdb-taxdump#taxonomic-hierarchy
A GTDB species cluster contains >=1 assemblies, each can be treated as a strain. So we can assign each assembly a TaxId with the rank of "no rank" below the species rank. Therefore, we can also track the changes of these assemblies via the TaxId later.
Don't worry the "no rank" which is below the species rank, so it belongs to "subspecies".
609216830 superkingdom Bacteria
947989846 phylum Firmicutes_A
1797966051 class Clostridia
1853814285 order Lachnospirales
3217231047 family Lachnospiraceae
1880979389 genus COE1
2414110737 species COE1 sp002358575
2538223356 no rank MGG00015
Do you have any tips on how to safely integrate this taxdump file with the one provided by NCBI? I want for example to use this custom GTDB taxdump together with the NCBI viral and fungi database from Kraken2. But I'm worried about the conflicts between taxid numbers.
It's a great idea. I think my taxonomic profiling tool KMCP should also use this combined taxonomy. Previsouly, we use the NCBI taxonomy for reference genomes from GTDB and Refseq.
To achieve this, you need to create taxdump files with both the GTDB lineages and NCBI lineages of the viral and fungi in one run.
- Get the 7-rank lineages of viral and fungi taxa with
taxonkit list | taxonkit reformat
. - Samely, get the 7-rank lineages of GTDB, either by directly reformating the GTDB taxonomy format or
taxonkit create-taxdump --gtdb
andtaxonkit list | taxonkit reformat
. - Simply concatenate all the lineages and call
taxonkit create-taxdump
.
I'll add the steps to the tutorial, maybe next week (We're on holiday).
@shenwei356 thank you for your reply.
I'll try your instructions and check the KMCP as well.
Best,Lucas
I've added some tutorials on Merging GTDB and NCBI taxonomy, which could help.
- If you need the taxdump files and the taxid.map file mapping genome assembly accessions to TaxIds, please follow Merging the GTDB taxonomy (for prokaryotic genomes from GTDB) and NCBI taxonomy (for genomes from NCBI).
- If you just need the taxdump files, please follow Merging GTDB and NCBI taxonomy.