Metabuli Feature request: `classify` output to full taxonomy

Amazing tool. So underrated.

Currently the metabuli classify output looks as follows:

1	1A1_1	240787	1020879	0.12028	subspecies	10005960:36536 
1	1A1_2	219446	793530	0.0133536	species	219446:3228 10022030:214 10022031:1 10022032:1 10022033:119 10022035:201 10022036:122 10022037:5 
1	1A1_3	257505	734130	0.0729156	subspecies	257502:5064 10026039:1551 10026040:10222

How would one change that into:

1A1_1	Bacteria;Bacillota;Clostridia;Eubacteriales
1A1_2	Bacteria;Pseudomonadota
1A1_3	Bacteria;Bacillota;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus

Also, Is there a documentation web site or PDF? What resources can you recommend for learning to use this beast of a tool?

Mar 18 '25 06:03 shiraz-shah

Hi @shiraz-shah the metabuli reports can be used as input for pavian, after do it you can dowload a table with the full taxonomy https://github.com/fbreitwieser/pavian

greetings

Mar 23 '25 16:03 Enkabloza

Awesome!! Thanks, @Enkabloza. Will try this.

If you know of a more light-weight and non-interactive alternative, do let me know. As I'm trying to automate this step as part of a pipeline.

Mar 30 '25 08:03 shiraz-shah

Thank you for reaching out. classify has --lineage option. It prints full taxonomy for each read. But you have to run classify again,, We will make a utility command that adds full taxa to existing results.

Apr 01 '25 04:04 jaebeom-kim

Thanks, Jaebeom.

A utility command would be extremely useful!

Apr 01 '25 06:04 shiraz-shah

I made classified2full command at my personal fork :) You can try it out!

Apr 03 '25 02:04 borijoa

OK, that's so cool! I'll have a go at it.

Apr 03 '25 09:04 shiraz-shah

Hi @jaebeom-kim @borijoa

I'm currently working on processing the output files from Metabuli to generate tables compatible with Phyloseq and Vegan, but I have some doubts regarding how to handle certain cases in the data.

One of the main issues I'm encountering is how to properly account for TaxIDs labeled as "no rank" that still have an associated k-mer count. Additionally, I've noticed that the same TaxID can appear at different taxonomic levels (like species, genus, or even "no rank") within the same classification file. I'm unsure whether I should sum these counts or treat them separately.

Example from a classification file:

1   22975   2772071  2934   0.0087   no rank   2772071:5  <--  
1   23066   2731619  14259  0.0022   class  
1   23068   936152   10503  0.0042   species   936152:12

In this case, TaxID 2772071 is labeled as "no rank" but still has a count value of 5. Should I include this in the OTU table?

Another example where the same TaxID appears at different levels:

Classified  Read ID   TaxID    Length   Identity    Level      TaxID:k-mer_count
1           1962      2785079  15876    0.0029      no rank    2785079:20
1           8595      2785079  20100    0.0044      no rank    2785079:22
1           24458     396034   26142    0.0034      species    2785079:22
1           55795     396034   26280    0.0034      species    2785079:22

Here, the TaxID 2785079 appears as "no rank" in some lines and as "species" in others. Should I sum the k-mer counts regardless of the taxonomic level, or should I treat them separately depending on the rank?

My Approach: OTU Table:

I extract the TaxID from the classification output files generated by Metabuli.

I count the occurrences of each TaxID across the samples to build a table where rows correspond to TaxIDs and columns correspond to samples. The cell values represent the frequency of each TaxID within a given sample.

Example:

TaxID   Sample1  Sample2  Sample3  Sample4
2785079     42      30      55      20
396034      15      25      30       0
1105171      5      12       8       3
1691958     20      18      25      10
936152       7       6      12       5
2956376      3       9       0       4

Taxonomy Table: I use the rankedlineage.dmp file to obtain the full taxonomy (from species to kingdom) for each TaxID present in the OTU table. This way, I build a Tax Table linking each TaxID to its complete taxonomy.

TaxID    Realm      Kingdom        Phylum         Class        Order          Family         Genus        Species        Name
2846592  Adnaviria  Zilligvirae    Taleaviricota  Tokiviricetes Ligamenvirales Rudiviridae    Usarudivirus NA            Usarudivirus_SIRV8
1983551  Adnaviria  Zilligvirae    Taleaviricota  Tokiviricetes Ligamenvirales Rudiviridae    Usarudivirus Usarudivirus_SIRV8 Sulfolobus_islandicus_rod-shaped_virus_8

Vegan-Compatible Tables: From the OTU table, I generate multiple abundance tables for each taxonomic level (species, genus, family, etc.) to use directly with Vegan for ecological analysis.

BIOM File for Phyloseq: I generate a .json.biom file by combining the OTU and taxonomy tables, making the data easier to analyze in Phyloseq.

Doubts: Should I include "no rank" TaxIDs with associated counts in the OTU table?

When the same TaxID appears at multiple taxonomic levels, should I sum the counts or consider them independently?

Is it appropriate to combine classification reports from multiple samples by merging counts from each sample based on TaxID? Any guidance on these points would be greatly appreciated. Thank you for your help and for developing such an incredible tool!

Best regards

Apr 07 '25 07:04 Enkabloza

"no rank" is a quite annoying issue. "no rank" can occur in different levels. For example, some subspecies/strains/isolates are linked to a species and labeled as "no rank". I observed "no rank" in other levels, too. I think when it is not sure which rank to use, they just labeled a taxon as "no rank". But taxa with "no rank" are still valid taxa. So for this question, Should I include "no rank" TaxIDs with associated counts in the OTU table? I think you can include them.

Classified  Read ID   TaxID    Length   Identity    Level      TaxID:k-mer_count
1           1962      2785079  15876    0.0029      no rank    2785079:20
1           8595      2785079  20100    0.0044      no rank    2785079:22
1           24458     396034   26142    0.0034      species    2785079:22
1           55795     396034   26280    0.0034      species    2785079:22

In this example, 2785079 must be a child taxon of 396034. For 1st and 2nd queries, k-mer matches to 2785079 are found, and they are classified to 2785079, which is a more specific taxon than species. For 3rd and 4th queries, k-mer matches to 2785079 are found, but they are not classified to 2785079, but it's species. It happens because Metabuli tries to avoid making too specific classification when evidence is not enough. The 3rd and 4th queries are longer than 1st and 2nd, so 22 k-mer matches is considered not enough only for 3rd and 4th.

So for this question,

When the same TaxID appears at multiple taxonomic levels, should I sum the counts or consider them independently?

Please consider only the third (TaxID) and sixth (Level) column.

Sorry but I couldn't fully understand what this question means.

Is it appropriate to combine classification reports from multiple samples by merging counts from each sample based on TaxID?

I have a concern about counting OTUs. Like the example above, 2785079 is a child of 396034. I'm not sure if 2785079 and 396034 should be counted independently. And also there can be classifications at genus and family rank.

Apr 07 '25 07:04 jaebeom-kim

@Enkabloza how did you finally settle this? were you finally able to make a phyloseq type of output from this?

May 08 '25 20:05 gilmahu

Hi @gilmahu I found a tool could help, you can use first taxconverter https://github.com/RasmussenLab/taxconverter and then taxometer to get the taxonomy file https://vamb.readthedocs.io/en/latest/ I hope this will help you, saludos

May 08 '25 20:05 Enkabloza

Hi @Enkabloza, Interesting. Thanks for your quick feedback. let me go through this and see how it works.

May 08 '25 21:05 gilmahu

Hello @Enkabloza ... just went through the process with the two tools and at the end i realized that the taxonomy file provided by taxconverter (which came from reads classification in metabuli) has to be consistent with the abundance data and the contig names from vamb. did you encounter such a situation and how did you deal with it?

May 14 '25 21:05 gilmahu

I will definitely bring a solution for the Phyloseq compatibility. Meanwhile, please try classifiedRefiner for the original issue of this page: classify output to full taxonomy

May 29 '25 01:05 jaebeom-kim