biom-format icon indicating copy to clipboard operation
biom-format copied to clipboard

Possible to represent relative abundance of taxonomy rather than OTUs?

Open wwood opened this issue 11 months ago • 1 comments

Hi,

I'm considering adding an option to SingleM to allow BIOM format as an output.

The problem is that I'm not sure about what the canonical way to do this is. The --taxonomic-profile output of SingleM currently produces a sparse 3 column TSV like this:

sample	coverage	taxonomy
ERR1914274	3.16	Root; d__Bacteria
ERR1914274	0.06	Root; d__Bacteria; p__Pseudomonadota; c__Gammaproteobacteria

So there is no "OTUs" as such (at least for this type of output) - it is just the estimated genome coverage of each lineage (which does not include the coverage of descendent lineages).

I considered using the taxons as the observation IDs, but if so was left with 2 questions:

  1. Should the coverage of a taxon include the coverage of its descendents? i.e. in the above should the entry for Root; d__Bacteria be 3.16 or 3.16+0.06=3.22 ?
  2. Relatedly, should the implied coverage of missing taxons be recorded? i.e. in the above should there be an observation recorded for Root; d__Bacteria; p__Pseudomonadota ?

Representing abundances of taxons is a pretty common usage e.g. kraken etc, but is complicated by the hierarchical nature of the observations. Bonus points if the schema of the taxonomy should be some how represented i.e. is there some way to record that the taxonomies are derived from GTDB R214 ?

Thanks, ben

wwood avatar Mar 24 '24 07:03 wwood

Hey @wwood! Been a bit, hope you're well :)

It's really up to you on how to structure this, and what will be most meaningful of a representation of the data for users of SingleM. BIOM as a format doesn't care if the entries are hierarchical or not. If you want to encode the taxonomy, it could be done via group metadata -- just represent it as a Newick string. I'm not aware of packages actually using the group metadata though. And, in the case of QIIME 2, it ignores sample and observation metadata anyway as in that framework those entities are under the semantic types of Metadata and FeatureData[Taxonomy] respectively.

wasade avatar Mar 24 '24 20:03 wasade

Hi @wwood, I'm closing this issue as I'm unsure how to address the concerns. Please reopen if needed

wasade avatar May 02 '24 15:05 wasade

Hi @wasade sorry for the lack of response here. Your reply makes total sense, though I might be lazy and wait for other tools to support it, so there's an established structure to work with.

Congratulations on gg2 btw, we are definitely making use, trying to bridge the amplicon genome gap. A taxonomy update would be most welcome if there is one coming?

wwood avatar May 03 '24 01:05 wwood

Hi @wwood, no worries and thanks! A taxonomy update is in the works. It would be nice to sync up sometime, any chance you could ping me at [email protected]?

wasade avatar May 03 '24 19:05 wasade