taxonkit
taxonkit copied to clipboard
taxids created with `create-taxdump` skip numbers
When you create a taxdump using create-taxdump
(ICTV taxonomy, for example), the taxids "skip" some numbers. For example:
$ head ictv-taxdump/names.dmp
1 | root | | scientific name |
287205 | Hoswirudivirus MRV1 | | scientific name |
287935 | Shomudavirus limadaptatum | | scientific name |
1096518 | Sclerotimonavirus betaclarireediae | | scientific name |
1138752 | Potato virus H | | scientific name |
1536674 | Rhopapillomavirus 1 | | scientific name |
1845995 | Monomorium pharaonis virus 1 | | scientific name |
1890985 | Aquamavirus A | | scientific name |
2079526 | Hylipavirus | | scientific name |
2290567 | Fattrevirus | | scientific name |
This is not a problem in itself, as the nodes are still connected. However, this causes a bug when you try to create a MMSeqs2 taxonomy database using the custom taxonomy, as it apparently assumes that numbers are not skipped (unless they are in delnodes.dmp and merged.dmp, I guess).
I wrote a script that mapped taxids such that no number is skipped and it solved the issue.
$ head ictv-taxdump/names.dmp
1 | root | | scientific name |
2 | Hoswirudivirus MRV1 | | scientific name |
3 | Shomudavirus limadaptatum | | scientific name |
4 | Sclerotimonavirus betaclarireediae | | scientific name |
5 | Potato virus H | | scientific name |
6 | Rhopapillomavirus 1 | | scientific name |
7 | Monomorium pharaonis virus 1 | | scientific name |
8 | Aquamavirus A | | scientific name |
9 | Hylipavirus | | scientific name |
10 | Fattrevirus | | scientific name |
This is not a TaxonKit bug in any way. But because MMSeqs2 is pretty popular, I thought it was best to report this here in case anyone else faces the same issue.
Yes, NCBI taxonomy uses consecutive numbers too. I guess they have a mapping table to maintain these relationships.
For reference, this is the script I used to make the taxids sequential: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py