taxpasta
taxpasta copied to clipboard
[BUG] MetaPhlAn 4 output with duplicate clade tax id is not supported
Is there an existing issue for this?
- [X] I have searched the existing issues
Problem description
As in title, this report is forward from https://github.com/nf-core/taxprofiler/issues/396.
The MetaPhlAn 4 output I'm using are in these gists, if any help: 2612_se_metaphlan4-db.metaphlan_profile.txt and 2613_se_metaphlan4-db.metaphlan_profile.txt
I think it's the duplicated tax id (as shown below) caused the error.
cat 2612_se_metaphlan4-db.metaphlan_profile.txt | grep '165179'
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A 2|976|200643|171549|171552|838|165179 15.15712
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C 2|976|200643|171549|171552|838|165179 3.48391
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B 2|976|200643|171549|171552|838|165179 1.31197
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F 2|976|200643|171549|171552|838|165179 0.34791
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A|t__SGB1626 2|976|200643|171549|171552|838|165179| 15.15712
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C|t__SGB1644 2|976|200643|171549|171552|838|165179| 3.48391 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_TF12_30,k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_AM23_5
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B|t__SGB1613 2|976|200643|171549|171552|838|165179| 1.31197
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F|t__SGB1614 2|976|200643|171549|171552|838|165179| 0.34791
I also run taxpasta standardise
on both MetaPhlAn 4 output files, taxpasta works but the result may have problem.
taxpasta standardise -p metaphlan -o standard_2612.tsv 2612_se_metaphlan4-db.metaphlan_profile.txt
[02:43:32] WARNING Combining 122 entries with unclassified taxa in the profile. metaphlan_profile_standardisation_service.py:94
INFO Write result to 'standard_2612.tsv'.
From this result 'standard_2612.tsv' I got 4 entries with the same tax id and different count:
cat standard_2612.tsv | grep '^165179\b'
165179 15157120
165179 3483910
165179 1311970
165179 347910
Code sample
Code run:
taxpasta merge \
-p metaphlan -o metaphlan_metaphlan4-db.tsv --add-name --add-rank --add-lineage --add-id-lineage --add-rank-lineage \
--taxonomy taxdump \
\
2612_se_metaphlan4-db.metaphlan_profile.txt 2613_se_metaphlan4-db.metaphlan_profile.txt
Traceback:
Traceback is too long, see this gist
At the end it says:
ValueError: Index has duplicate keys: CategoricalIndex([165179], categories=[0,
2, 468, 469, ..., 2003188, 2082587, 2292893, 2887326], ordered=False,
dtype='category', name='taxonomy_id')
Environment
I'm running taxpastat under local docker container, which runs quay.io/biocontainers/taxpasta:0.6.1--pyhdfd78af_0
Anything else?
No response