Feature request: `classify` output to full taxonomy
Amazing tool. So underrated.
Currently the metabuli classify output looks as follows:
1 1A1_1 240787 1020879 0.12028 subspecies 10005960:36536
1 1A1_2 219446 793530 0.0133536 species 219446:3228 10022030:214 10022031:1 10022032:1 10022033:119 10022035:201 10022036:122 10022037:5
1 1A1_3 257505 734130 0.0729156 subspecies 257502:5064 10026039:1551 10026040:10222
How would one change that into:
1A1_1 Bacteria;Bacillota;Clostridia;Eubacteriales
1A1_2 Bacteria;Pseudomonadota
1A1_3 Bacteria;Bacillota;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus
Also, Is there a documentation web site or PDF? What resources can you recommend for learning to use this beast of a tool?
Hi @shiraz-shah the metabuli reports can be used as input for pavian, after do it you can dowload a table with the full taxonomy https://github.com/fbreitwieser/pavian
greetings
Awesome!! Thanks, @Enkabloza. Will try this.
If you know of a more light-weight and non-interactive alternative, do let me know. As I'm trying to automate this step as part of a pipeline.
Thank you for reaching out.
classify has --lineage option.
It prints full taxonomy for each read.
But you have to run classify again,,
We will make a utility command that adds full taxa to existing results.
Thanks, Jaebeom.
A utility command would be extremely useful!
I made classified2full command at my personal fork :) You can try it out!
OK, that's so cool! I'll have a go at it.
Hi @jaebeom-kim @borijoa
I'm currently working on processing the output files from Metabuli to generate tables compatible with Phyloseq and Vegan, but I have some doubts regarding how to handle certain cases in the data.
One of the main issues I'm encountering is how to properly account for TaxIDs labeled as "no rank" that still have an associated k-mer count. Additionally, I've noticed that the same TaxID can appear at different taxonomic levels (like species, genus, or even "no rank") within the same classification file. I'm unsure whether I should sum these counts or treat them separately.
Example from a classification file:
1 22975 2772071 2934 0.0087 no rank 2772071:5 <--
1 23066 2731619 14259 0.0022 class
1 23068 936152 10503 0.0042 species 936152:12
In this case, TaxID 2772071 is labeled as "no rank" but still has a count value of 5. Should I include this in the OTU table?
Another example where the same TaxID appears at different levels:
Classified Read ID TaxID Length Identity Level TaxID:k-mer_count
1 1962 2785079 15876 0.0029 no rank 2785079:20
1 8595 2785079 20100 0.0044 no rank 2785079:22
1 24458 396034 26142 0.0034 species 2785079:22
1 55795 396034 26280 0.0034 species 2785079:22
Here, the TaxID 2785079 appears as "no rank" in some lines and as "species" in others. Should I sum the k-mer counts regardless of the taxonomic level, or should I treat them separately depending on the rank?
My Approach: OTU Table:
I extract the TaxID from the classification output files generated by Metabuli.
I count the occurrences of each TaxID across the samples to build a table where rows correspond to TaxIDs and columns correspond to samples. The cell values represent the frequency of each TaxID within a given sample.
Example:
TaxID Sample1 Sample2 Sample3 Sample4
2785079 42 30 55 20
396034 15 25 30 0
1105171 5 12 8 3
1691958 20 18 25 10
936152 7 6 12 5
2956376 3 9 0 4
Taxonomy Table: I use the rankedlineage.dmp file to obtain the full taxonomy (from species to kingdom) for each TaxID present in the OTU table. This way, I build a Tax Table linking each TaxID to its complete taxonomy.
TaxID Realm Kingdom Phylum Class Order Family Genus Species Name
2846592 Adnaviria Zilligvirae Taleaviricota Tokiviricetes Ligamenvirales Rudiviridae Usarudivirus NA Usarudivirus_SIRV8
1983551 Adnaviria Zilligvirae Taleaviricota Tokiviricetes Ligamenvirales Rudiviridae Usarudivirus Usarudivirus_SIRV8 Sulfolobus_islandicus_rod-shaped_virus_8
Vegan-Compatible Tables: From the OTU table, I generate multiple abundance tables for each taxonomic level (species, genus, family, etc.) to use directly with Vegan for ecological analysis.
BIOM File for Phyloseq: I generate a .json.biom file by combining the OTU and taxonomy tables, making the data easier to analyze in Phyloseq.
Doubts: Should I include "no rank" TaxIDs with associated counts in the OTU table?
When the same TaxID appears at multiple taxonomic levels, should I sum the counts or consider them independently?
Is it appropriate to combine classification reports from multiple samples by merging counts from each sample based on TaxID? Any guidance on these points would be greatly appreciated. Thank you for your help and for developing such an incredible tool!
Best regards
"no rank" is a quite annoying issue.
"no rank" can occur in different levels.
For example, some subspecies/strains/isolates are linked to a species and labeled as "no rank".
I observed "no rank" in other levels, too.
I think when it is not sure which rank to use, they just labeled a taxon as "no rank".
But taxa with "no rank" are still valid taxa.
So for this question,
Should I include "no rank" TaxIDs with associated counts in the OTU table?
I think you can include them.
Classified Read ID TaxID Length Identity Level TaxID:k-mer_count
1 1962 2785079 15876 0.0029 no rank 2785079:20
1 8595 2785079 20100 0.0044 no rank 2785079:22
1 24458 396034 26142 0.0034 species 2785079:22
1 55795 396034 26280 0.0034 species 2785079:22
In this example, 2785079 must be a child taxon of 396034. For 1st and 2nd queries, k-mer matches to 2785079 are found, and they are classified to 2785079, which is a more specific taxon than species. For 3rd and 4th queries, k-mer matches to 2785079 are found, but they are not classified to 2785079, but it's species. It happens because Metabuli tries to avoid making too specific classification when evidence is not enough. The 3rd and 4th queries are longer than 1st and 2nd, so 22 k-mer matches is considered not enough only for 3rd and 4th.
So for this question,
When the same TaxID appears at multiple taxonomic levels, should I sum the counts or consider them independently?
Please consider only the third (TaxID) and sixth (Level) column.
Sorry but I couldn't fully understand what this question means.
Is it appropriate to combine classification reports from multiple samples by merging counts from each sample based on TaxID?
- I have a concern about counting OTUs. Like the example above, 2785079 is a child of 396034. I'm not sure if 2785079 and 396034 should be counted independently. And also there can be classifications at genus and family rank.
@Enkabloza how did you finally settle this? were you finally able to make a phyloseq type of output from this?
Hi @gilmahu I found a tool could help, you can use first taxconverter https://github.com/RasmussenLab/taxconverter and then taxometer to get the taxonomy file https://vamb.readthedocs.io/en/latest/ I hope this will help you, saludos
Hi @Enkabloza, Interesting. Thanks for your quick feedback. let me go through this and see how it works.
Hello @Enkabloza ... just went through the process with the two tools and at the end i realized that the taxonomy file provided by taxconverter (which came from reads classification in metabuli) has to be consistent with the abundance data and the contig names from vamb. did you encounter such a situation and how did you deal with it?
I will definitely bring a solution for the Phyloseq compatibility.
Meanwhile, please try classifiedRefiner for the original issue of this page: classify output to full taxonomy