taxonkit
taxonkit copied to clipboard
Discrepancy GTDB website and taxdump changelog
Prerequisites
- [x] make sure you're are using the latest version by
taxonkit version
- [x] read the usage
Describe your issue
- [x] describe the problem
Thanks for developing taxonkit and for sharing the taxdumps! it saves so much trouble.
There was this change in GTDB: R202 "CAG-521" -> R207 "Aphodousia".
I used your latest GTDB taxdump changelog which shows that CAG-521 was DELETED, Aphodousia NEW. However, I'm unable to get the connection that one changed into the other.
Going as per docs I run into this:
echo "CAG-521" | taxonkit name2taxid --data-dir $R202 | taxonkit lineage --taxid-field 2 --data-dir $R207
16:22:16.067 [WARN] taxid 1435403146 was deleted
CAG-521 1435403146
I'm not sure whether this is due to the taxdumps or taxonkit, so I post here.
CAG-521
cat <(zless gtdb-taxid-changelog.csv.gz | head -n1 | sed 's/,/\t/g') <(zless gtdb-taxid-changelog.csv.gz | grep 'R20[27]' | grep 'CAG-521' | sed 's/,/\t/g') | column -t
taxid version change change-value name rank lineage lineage-taxids
279141433 R207 DELETE CAG-521 sp003543795 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp003543795 609216830;1641076285;329474883;2125578642;1754850155;1435403146;279141433
349671556 R207 DELETE CAG-521 sp900554675 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp900554675 609216830;1641076285;329474883;2125578642;1754850155;1435403146;349671556
494738701 R207 DELETE CAG-521 sp900545335 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp900545335 609216830;1641076285;329474883;2125578642;1754850155;1435403146;494738701
516566981 R207 DELETE CAG-521 sp000437635 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp000437635 609216830;1641076285;329474883;2125578642;1754850155;1435403146;516566981
587147611 R207 DELETE CAG-521 sp900553105 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp900553105 609216830;1641076285;329474883;2125578642;1754850155;1435403146;587147611
602392633 R202 NEW 902388655 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp902388655;902388655 609216830;1641076285;329474883;2125578642;1754850155;1435403146;1756269640;602392633
725664906 R207 DELETE CAG-521 sp900546995 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp900546995 609216830;1641076285;329474883;2125578642;1754850155;1435403146;725664906
747615494 R202 NEW 900754945 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp900544345;900754945 609216830;1641076285;329474883;2125578642;1754850155;1435403146;1008765200;747615494
825111592 R202 NEW 900765595 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp902388655;900765595 609216830;1641076285;329474883;2125578642;1754850155;1435403146;1756269640;825111592
1008765200 R207 DELETE CAG-521 sp900544345 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp900544345 609216830;1641076285;329474883;2125578642;1754850155;1435403146;1008765200
1251617747 R207 DELETE CAG-521 sp002329575 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp002329575 609216830;1641076285;329474883;2125578642;1754850155;1435403146;1251617747
1435403146 R207 DELETE CAG-521 genus Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521 609216830;1641076285;329474883;2125578642;1754850155;1435403146
1756269640 R202 NEW CAG-521 sp902388655 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp902388655 609216830;1641076285;329474883;2125578642;1754850155;1435403146;1756269640
1756269640 R207 DELETE CAG-521 sp902388655 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;CAG-521;CAG-521 sp902388655 609216830;1641076285;329474883;2125578642;1754850155;1435403146;1756269640
Aphodousia:
cat <(zless gtdb-taxid-changelog.csv.gz | head -n1 | sed 's/,/\t/g') <(zless gtdb-taxid-changelog.csv.gz | grep 'R20[27]' | grep 'Aphodousia' | sed 's/,/\t/g') | column -t
taxid version change change-value name rank lineage lineage-taxids
13156977 R207 CHANGE_LIN_TAX 900544315 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp002329575;900544315 609216830;1641076285;329474883;2125578642;1754850155;1577673191;465580961;13156977
101047054 R207 CHANGE_LIN_TAX 003543795 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp003543795;003543795 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1688730210;101047054
159241665 R207 NEW 018714185 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia gallistercoris;018714185 609216830;1641076285;329474883;2125578642;1754850155;1577673191;262228660;159241665
255288910 R207 NEW Aphodousia sp017383055 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp017383055 609216830;1641076285;329474883;2125578642;1754850155;1577673191;255288910
262228660 R207 NEW Aphodousia gallistercoris species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia gallistercoris 609216830;1641076285;329474883;2125578642;1754850155;1577673191;262228660
265694794 R207 NEW Aphodousia sp900546995 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900546995 609216830;1641076285;329474883;2125578642;1754850155;1577673191;265694794
366289909 R207 NEW 905204555 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecalis;905204555 609216830;1641076285;329474883;2125578642;1754850155;1577673191;626891884;366289909
394452769 R207 NEW Aphodousia secunda_A species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia secunda_A 609216830;1641076285;329474883;2125578642;1754850155;1577673191;394452769
404286898 R207 CHANGE_LIN_TAX 900544345 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900544345;900544345 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1354447789;404286898
420315642 R207 CHANGE_LIN_TAX 000437635 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecalis;000437635 609216830;1641076285;329474883;2125578642;1754850155;1577673191;626891884;420315642
465580961 R207 NEW Aphodousia sp002329575 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp002329575 609216830;1641076285;329474883;2125578642;1754850155;1577673191;465580961
506319002 R207 NEW Aphodousia sp900545335 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900545335 609216830;1641076285;329474883;2125578642;1754850155;1577673191;506319002
599325129 R207 NEW Aphodousia sp905201055 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp905201055 609216830;1641076285;329474883;2125578642;1754850155;1577673191;599325129
602392633 R207 CHANGE_LIN_TAX 902388655 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp902388655;902388655 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1800285846;602392633
609241997 R207 NEW 017646335 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp017646335;017646335 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1665428462;609241997
612589562 R207 CHANGE_LIN_TAX 900546995 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900546995;900546995 609216830;1641076285;329474883;2125578642;1754850155;1577673191;265694794;612589562
626891884 R207 NEW Aphodousia faecalis species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecalis 609216830;1641076285;329474883;2125578642;1754850155;1577673191;626891884
663056101 R207 NEW 905206345 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp905206345;905206345 609216830;1641076285;329474883;2125578642;1754850155;1577673191;2119934576;663056101
732865391 R207 NEW Aphodousia faecigallinarum species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecigallinarum 609216830;1641076285;329474883;2125578642;1754850155;1577673191;732865391
747615494 R207 CHANGE_LIN_TAX 900754945 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900544345;900754945 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1354447789;747615494
825111592 R207 CHANGE_LIN_TAX 900765595 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp902388655;900765595 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1800285846;825111592
1023245325 R207 CHANGE_LIN_TAX 900554675 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900554675;900554675 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1477125710;1023245325
1024674004 R207 NEW 905187975 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900545335;905187975 609216830;1641076285;329474883;2125578642;1754850155;1577673191;506319002;1024674004
1044151090 R207 NEW 905197765 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900546995;905197765 609216830;1641076285;329474883;2125578642;1754850155;1577673191;265694794;1044151090
1116976797 R207 NEW 017500925 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp017383055;017500925 609216830;1641076285;329474883;2125578642;1754850155;1577673191;255288910;1116976797
1247819496 R207 NEW 905212135 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp002329575;905212135 609216830;1641076285;329474883;2125578642;1754850155;1577673191;465580961;1247819496
1317984236 R207 NEW 018712705 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecalis;018712705 609216830;1641076285;329474883;2125578642;1754850155;1577673191;626891884;1317984236
1321012077 R207 CHANGE_LIN_TAX 002329575 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp002329575;002329575 609216830;1641076285;329474883;2125578642;1754850155;1577673191;465580961;1321012077
1335180995 R207 NEW Aphodousia faecipullorum species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecipullorum 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1335180995
1354447789 R207 NEW Aphodousia sp900544345 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900544345 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1354447789
1477125710 R207 NEW Aphodousia sp900554675 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900554675 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1477125710
1577673191 R207 NEW Aphodousia genus Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia 609216830;1641076285;329474883;2125578642;1754850155;1577673191
1651854969 R207 NEW 905212345 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp905212345;905212345 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1703059417;1651854969
1665428462 R207 NEW Aphodousia sp017646335 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp017646335 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1665428462
1688730210 R207 NEW Aphodousia sp003543795 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp003543795 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1688730210
1703059417 R207 NEW Aphodousia sp905212345 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp905212345 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1703059417
1800285846 R207 NEW Aphodousia sp902388655 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp902388655 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1800285846
1808979396 R207 NEW 905201055 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp905201055;905201055 609216830;1641076285;329474883;2125578642;1754850155;1577673191;599325129;1808979396
1827248664 R207 NEW 018714205 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecipullorum;018714205 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1335180995;1827248664
1832179602 R207 NEW 016901835 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia secunda_A;016901835 609216830;1641076285;329474883;2125578642;1754850155;1577673191;394452769;1832179602
1834432164 R207 NEW 018714755 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecigallinarum;018714755 609216830;1641076285;329474883;2125578642;1754850155;1577673191;732865391;1834432164
1859933766 R207 NEW 905198185 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900544345;905198185 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1354447789;1859933766
1927048762 R207 CHANGE_LIN_TAX 900544925 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia faecalis;900544925 609216830;1641076285;329474883;2125578642;1754850155;1577673191;626891884;1927048762
1927253407 R207 NEW 017383055 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp017383055;017383055 609216830;1641076285;329474883;2125578642;1754850155;1577673191;255288910;1927253407
1949292207 R207 CHANGE_LIN_TAX 900553105 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900553105;900553105 609216830;1641076285;329474883;2125578642;1754850155;1577673191;2063973024;1949292207
1960767925 R207 NEW 905196825 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp902388655;905196825 609216830;1641076285;329474883;2125578642;1754850155;1577673191;1800285846;1960767925
2033651867 R207 CHANGE_LIN_TAX 900545335 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900545335;900545335 609216830;1641076285;329474883;2125578642;1754850155;1577673191;506319002;2033651867
2063973024 R207 NEW Aphodousia sp900553105 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp900553105 609216830;1641076285;329474883;2125578642;1754850155;1577673191;2063973024
2119934576 R207 NEW Aphodousia sp905206345 species Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia sp905206345 609216830;1641076285;329474883;2125578642;1754850155;1577673191;2119934576
2146281175 R207 NEW 018715245 no rank Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Burkholderiaceae;Aphodousia;Aphodousia secunda_A;018715245 609216830;1641076285;329474883;2125578642;1754850155;1577673191;394452769;2146281175
Thanks for the feedback.
I am very glad to see that GTDB has a https://gtdb.ecogenomic.org/taxon-history page!
taxonkit taxid-changelog
was first designed for NCBI taxonomy, in which the changes are more continuous and not as drastic as GTDB. So some results are not satisfying, I'm sorry for this.
I've checked the source code and also some records, like a g__CAG-521 species. I do think I should revise the command someday, after finishing recent work.
Thanks a lot for looking into this already.
So do you see this as a problem in the taxid-changelog
command or in the taxdumps and the lineage
command? Would this change be correctly picked up by lineage
if documented differently in the taxdumps or would this in no case be resolved by this command?
lineage
works fine. It's just the taxid-changelog
, which did not handle some edge cases appropriately.
AS every single version of GTDB-taxonomy, it's correct and there's no known issue, only the deleted.dmp
and merged.dmp
files are not perfect which most tools do not use.
I just released a new version of gtdb-taxdump, which has better support for duplicated names with different ranks. And the taxids are totally changed. (not related to this issue).
(And I return to this issue again before the new release of taxonkit.)
I'm wondering if I can improve it. The answer is no for now. In NCBI taxonomy, the TaxIds are stable, so I can directly check if the taxon names is changed by comparing names in the adjacent two versions. While for GTDB taxonomy, I generate TaxIds from the hash value of
- before v0.16.0: the taxon name
- after v0.16.0: rank+taxon_name
So it's hard to detect renaming events for GTDB taxonomy.
But if we check the change history of an assembly, it's OK, showing CHANGE_LIN_TAX
, meaning there are big changes.
$ grep GCA_003543795.1 gtdb-taxdump/R214/taxid.map
GCA_003543795.1 60618853
$ zcat gtdb-taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 60618853 \
| csvtk cut -f -change-value,-lineage-taxids \
| csvtk pretty -W 40 -x ";" -S light
┌----------┬---------┬----------------┬-----------┬---------┬------------------------------------------┐
| taxid | version | change | name | rank | lineage |
├==========┼=========┼================┼===========┼=========┼==========================================┤
| 60618853 | R089 | NEW | 003543795 | no rank | Bacteria;Proteobacteria; |
| | | | | | Gammaproteobacteria;Burkholderiales; |
| | | | | | Burkholderiaceae;CAG-521; |
| | | | | | CAG-521 sp003543795;003543795 |
├----------┼---------┼----------------┼-----------┼---------┼------------------------------------------┤
| 60618853 | R207 | CHANGE_LIN_TAX | 003543795 | no rank | Bacteria;Proteobacteria; |
| | | | | | Gammaproteobacteria;Burkholderiales; |
| | | | | | Burkholderiaceae;Aphodousia; |
| | | | | | Aphodousia sp003543795;003543795 |
├----------┼---------┼----------------┼-----------┼---------┼------------------------------------------┤
| 60618853 | R214 | CHANGE_LIN_TAX | 003543795 | no rank | Bacteria;Pseudomonadota; |
| | | | | | Gammaproteobacteria;Burkholderiales; |
| | | | | | Burkholderiaceae_A;Aphodousia; |
| | | | | | Aphodousia sp003543795;003543795 |
└----------┴---------┴----------------┴-----------┴---------┴------------------------------------------┘
I also add notes to taxid-changelog
.
$ taxonkit taxid-changelog -h
Create TaxId changelog from dump archives
Attention:
1. This command was originally designed for NCBI taxonomy, where the the TaxIds are stable.
2. For other taxonomic data created by "taxonkit create-taxdump", e.g., GTDB-taxdump,
some change events might be wrong, because
a) There would be dramatic changes between the two versions.
b) Different taxons in multiple versions might have the same TaxIds, because we only
check and eliminate taxid collision within a single version.
So a single version of taxonomic data created by "taxonkit create-taxdump" has no problem,
it's just the changelog might not be perfect.
Note in create-taxdump
:
3. We only check and eliminate taxid collision within a single version of taxonomy data.
Therefore, if you create taxid-changelog with "taxid-changelog", different taxons
in multiple versions might have the same TaxIds and some change events might be wrong.
So a single version of taxonomic data created by "taxonkit create-taxdump" has no problem,
it's just the changelog might not be perfect.