Issue with incorrect virus classification using metabuli binning2report function and suggestion for ICTV taxonomy update
Hi Metabuli team,
First of all, I would like to thank you for such an excellent tool, I really enjoy using it. I’m currently using Metabuli to classify my viral metagenome sequences, and I have been using the viral database provided by Metabuli for my analyses. After running the classification, I obtain a report file that looks like this:
98.6127 39657793 39657793 no rank 0 unclassified
1.3873 557902 166 no rank 1 root
1.3787 554456 812 superkingdom 10239 Viruses
1.2988 522339 0 clade 2731341 Duplodnaviria
1.2988 522339 31 kingdom 2731360 Heunggongvirae
1.2938 520316 0 phylum 2731618 Uroviricota
1.2938 520316 14652 class 2731619 Caudoviricetes
0.5720 230022 229876 genus 2843396 Jouyvirus
0.0003 105 0 species 2844245 Jouyvirus ev017
0.0003 105 105 no rank 2847060 Escherichia phage ev017
After generating this report, I attempt to convert it into a Kraken-style format using the metabuli binning2report function. However, during this conversion, I encounter an issue where the output file no longer includes virus classifications but instead focuses solely on bacteria. The output looks like this:
30.75 1051 1051 no rank 0 unclassified
54.13 1850 485 no rank 1 root
39.88 1363 0 no rank 131567 cellular organisms
39.47 1349 230 superkingdom 2 Bacteria
17.50 598 0 phylum 1224 Pseudomonadota
7.72 264 0 class 28211 Alphaproteobacteria
5.47 187 0 order 356 Hyphomicrobiales
4.27 146 0 family 335928 Xanthobacteraceae
4.04 138 67 genus 6 Azorhizobium
2.08 71 71 species 7 Azorhizobium caulinodans
Why does this issue occur? My goal is to convert my report to Kraken format so that I can eventually transform the files into a BIOM (Biological Observation Matrix) format. This would allow me to combine all my reports into a single file, and then I could use the R phyloseq package to generate various statistics from my samples.
Additionally, I noticed there have been previous requests regarding updating the taxonomy to align with ICTV. I came across two resources that might be helpful:
This one explains how to construct NCBI-style taxdump files for the International Committee on Taxonomy of Viruses (ICTV): https://github.com/shenwei356/ictv-taxdump
This other resource provides a tutorial on how to build a protein FASTA database for ICTV (though it is adapted for MMseqs2, it might help in building an ICTV viral database for Metabuli): https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/README.md
Thank you for your time and attention. I hope you have a great start to the week!
Best regards,
Thank you so much for the resources of ICTV !! Sounds very useful.
binning2report was implemented for internal use and wasn't maintained well.
I didn't know that it's visible to users.
It doesn't convert metabuli's report to kraken's report, but it converts read-by-read classification to metabuli's report.
I wrote Kraken style report file because I thought metabuli's report is following kraken's style.
Let me check BIOM format and see if I can make a module for the conversion you want. Thanks again!
Thank you so much for the tips !! I could build a viral DB based on ICVT VMR39.2. It's available here. https://hulk.mmseqs.com/jaebeom/vmr39.2/ Could you try this and give feedback? Thanks again :)
Please use this Metabuli version. https://github.com/jaebeom-kim/Metabuli The DB is not compatible to the latest release.
Thank you so much for the tips !! I could build a viral DB based on ICVT VMR39.2. It's available here. https://hulk.mmseqs.com/jaebeom/vmr39.2/ Could you try this and give feedback? Thanks again :)
Please use this Metabuli version. https://github.com/jaebeom-kim/Metabuli The DB is not compatible to the latest release.
Thanks for the effort. Out of curiosity I just checked the latest ICTV taxonomy. They now put SARS-CoV-1 and SARS-CoV-2 all into this weird species name Betacoronavirus pandemicum !! Hope you guys dont follow ICTV taxonomy so soon as it is now so confusing...
Thank you very much, Jaebeom. I hope you're having an excellent start to the week, and I appreciate you taking the time to read my comments and share your database. I apologize for the delayed response. On another note, I have downloaded the ICTV VMR39.2 database that you shared.
Currently, I am using this version of Metabuli: Metabuli Version 1.0.8. I have a question: is the version I downloaded different from the one available in this directory? -> https://github.com/jaebeom-kim/Metabuli
Another question: Should I download the nodes.dmp and names.dmp files from here > https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/ or do I need to build my own taxonomy files using TaxonKit and create an ICTV taxdump file?
Thank you again for your help, and I wish you an excellent day!
Best regards
Hi! Your comment was very helpful!
Currently, I am using this version of Metabuli: Metabuli Version 1.0.8. I have a question: is the version I downloaded different from the one available in this directory? -> https://github.com/jaebeom-kim/Metabuli
Yes, please use https://github.com/jaebeom-kim/Metabuli when you try the ICTV database. Sorry for this inconvenience. Taxonkit uses the full range of 32 bit integer for taxonomy ID, but Metabuli used only 31 bits, so I made a quick fix in my fork.
Another question: Should I download the nodes.dmp and names.dmp files from here > https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/ or do I need to build my own taxonomy files using TaxonKit and create an ICTV taxdump file?
You don't need to download any dmp files to try the ICTV database. But, I just shared dmp files in https://hulk.mmseqs.com/jaebeom/vmr39.2/ictv-taxdump/ just in case.
Thanks again!
https://github.com/jaebeom-kim/Metabuli is moved to https://github.com/jaebeom-kim/Metabuli/tree/taxid
Hi Jaebeom,
I hope you are doing well. I recently read the latest paper from your research group, "BFVD—a large repository of predicted viral protein structures", and I must say it’s an excellent resource for those of us working in metagenomics. https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae1119/7906834?login=false
I was wondering if you are planning to integrate the Big Fantastic Viral Database (BFVD) into the Metabuli software? It would be an incredible feature for virome searches.
On a related note, I have been trying to convert the output formats from Metabuli to the BIOM format so that I can use the phyloseq R package for downstream analyses, such as alpha and beta diversity metrics between my samples. However, I haven’t been successful so far.
I came across a GitHub repository for a tool that converts Kraken2 output reports into BIOM format. I’m not sure if this could be useful in the context of Metabuli outputs: https://github.com/smdabdoub/kraken-biom
Any suggestions or guidance on this would be greatly appreciated.
Thank you for your time, and I look forward to hearing your thoughts!
Best regards