qurro Split up taxonomy annotations into multiple columns, one for each level

Split up taxonomy annotations into multiple columns, one for each level

Open fedarko opened this issue 4 years ago • 4 comments

We're doing this for Empress (https://github.com/biocore/empress/issues/130), and I'm realizing this might be useful to have in Qurro as well. This would mean converting the feature metadata from something like

Feature ID	Taxonomy
asdf	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
ghjk	k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__; s__

into something like

Feature ID	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7
asdf	k__Bacteria	p__Bacteroidetes	c__Bacteroidia	o__Bacteroidales	f__Bacteroidaceae	g__Bacteroides	s__
ghjk	k__Bacteria	p__Proteobacteria	c__Gammaproteobacteria	o__Pasteurellales	f__Pasteurellaceae	g__	s__

... We should be able to do this entirely on the python side of things. The advantage of this is that this'd allow searching by just genera / etc., which saves you from some problems where the same string is used in different levels (e.g. proteobacteria being present in both p__Proteobacteria and c__Gammaproteobacteria, not that it makes a huge difference for the above example).

There are ofc some problems with this, for example what happens when features have different numbers of "levels" (which is the case in the MetaPhlAn2 (?) taxonomy information for the Byrd dataset -- e.g. there are Viruses with 4 levels and Bacteria with 7 in the same dataset). But these problem should be surmountable; for this particular problem we could, say, "pad" missing levels with nulls or whatever.

I'm putting this on the backburner now while we do this in Empress, but at some point in the future it may be nice to port that back over to here.

Edit: also, now that I think of it, having this for biplots in Emperor could be really nice?

May 25 '20 06:05 fedarko

qurro qurro copied to clipboard

Split up taxonomy annotations into multiple columns, one for each level

qurro
qurro copied to clipboard