taxpasta icon indicating copy to clipboard operation
taxpasta copied to clipboard

[BUG] MetaPhlAn 4 output with duplicate clade tax id is not supported

Open MajoroMask opened this issue 8 months ago • 4 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Problem description

As in title, this report is forward from https://github.com/nf-core/taxprofiler/issues/396.

The MetaPhlAn 4 output I'm using are in these gists, if any help: 2612_se_metaphlan4-db.metaphlan_profile.txt and 2613_se_metaphlan4-db.metaphlan_profile.txt

I think it's the duplicated tax id (as shown below) caused the error.

cat 2612_se_metaphlan4-db.metaphlan_profile.txt | grep '165179'
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A        2|976|200643|171549|171552|838|165179       15.15712
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C        2|976|200643|171549|171552|838|165179       3.48391
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B        2|976|200643|171549|171552|838|165179       1.31197
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F        2|976|200643|171549|171552|838|165179       0.34791
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_A|t__SGB1626     2|976|200643|171549|171552|838|165179|      15.15712
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_C|t__SGB1644     2|976|200643|171549|171552|838|165179|      3.48391 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_TF12_30,k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_sp_AM23_5
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_B|t__SGB1613     2|976|200643|171549|171552|838|165179|      1.31197
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri_clade_F|t__SGB1614     2|976|200643|171549|171552|838|165179|      0.34791

I also run taxpasta standardise on both MetaPhlAn 4 output files, taxpasta works but the result may have problem.

taxpasta standardise -p metaphlan -o standard_2612.tsv 2612_se_metaphlan4-db.metaphlan_profile.txt
[02:43:32] WARNING  Combining 122 entries with unclassified taxa in the profile.             metaphlan_profile_standardisation_service.py:94
           INFO     Write result to 'standard_2612.tsv'.

From this result 'standard_2612.tsv' I got 4 entries with the same tax id and different count:

cat standard_2612.tsv | grep '^165179\b'
165179  15157120
165179  3483910
165179  1311970
165179  347910

Code sample

Code run:

taxpasta merge \
    -p metaphlan -o metaphlan_metaphlan4-db.tsv --add-name --add-rank --add-lineage --add-id-lineage --add-rank-lineage \
    --taxonomy taxdump \
     \
    2612_se_metaphlan4-db.metaphlan_profile.txt 2613_se_metaphlan4-db.metaphlan_profile.txt

Traceback:

Traceback is too long, see this gist

At the end it says:

ValueError: Index has duplicate keys: CategoricalIndex([165179], categories=[0, 
2, 468, 469, ..., 2003188, 2082587, 2292893, 2887326], ordered=False, 
dtype='category', name='taxonomy_id')

Environment

I'm running taxpastat under local docker container, which runs quay.io/biocontainers/taxpasta:0.6.1--pyhdfd78af_0

Anything else?

No response

MajoroMask avatar Oct 08 '23 02:10 MajoroMask