taxonkit icon indicating copy to clipboard operation
taxonkit copied to clipboard

taxids created with `create-taxdump` skip numbers

Open apcamargo opened this issue 2 years ago • 2 comments

When you create a taxdump using create-taxdump (ICTV taxonomy, for example), the taxids "skip" some numbers. For example:

$ head ictv-taxdump/names.dmp
1	|	root	|		|	scientific name	|
287205	|	Hoswirudivirus MRV1	|		|	scientific name	|
287935	|	Shomudavirus limadaptatum	|		|	scientific name	|
1096518	|	Sclerotimonavirus betaclarireediae	|		|	scientific name	|
1138752	|	Potato virus H	|		|	scientific name	|
1536674	|	Rhopapillomavirus 1	|		|	scientific name	|
1845995	|	Monomorium pharaonis virus 1	|		|	scientific name	|
1890985	|	Aquamavirus A	|		|	scientific name	|
2079526	|	Hylipavirus	|		|	scientific name	|
2290567	|	Fattrevirus	|		|	scientific name	|

This is not a problem in itself, as the nodes are still connected. However, this causes a bug when you try to create a MMSeqs2 taxonomy database using the custom taxonomy, as it apparently assumes that numbers are not skipped (unless they are in delnodes.dmp and merged.dmp, I guess).

I wrote a script that mapped taxids such that no number is skipped and it solved the issue.

$ head ictv-taxdump/names.dmp
1	|	root	|		|	scientific name	|
2	|	Hoswirudivirus MRV1	|		|	scientific name	|
3	|	Shomudavirus limadaptatum	|		|	scientific name	|
4	|	Sclerotimonavirus betaclarireediae	|		|	scientific name	|
5	|	Potato virus H	|		|	scientific name	|
6	|	Rhopapillomavirus 1	|		|	scientific name	|
7	|	Monomorium pharaonis virus 1	|		|	scientific name	|
8	|	Aquamavirus A	|		|	scientific name	|
9	|	Hylipavirus	|		|	scientific name	|
10	|	Fattrevirus	|		|	scientific name	|

This is not a TaxonKit bug in any way. But because MMSeqs2 is pretty popular, I thought it was best to report this here in case anyone else faces the same issue.

apcamargo avatar May 20 '22 21:05 apcamargo

Yes, NCBI taxonomy uses consecutive numbers too. I guess they have a mapping table to maintain these relationships.

shenwei356 avatar May 21 '22 03:05 shenwei356

For reference, this is the script I used to make the taxids sequential: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py

apcamargo avatar May 25 '22 00:05 apcamargo