empress Project phylogeny up tree if provided

Project phylogeny up tree if provided

Open kwcantrell opened this issue 3 years ago • 20 comments

It would be nice to have empress automatically label internal with a taxonomy if a user provided a taxonomy .qza file. I thought we had this feature already implemented but after a discussion with Imran, I realized this is not currently implemented.

Dec 19 '20 02:12 kwcantrell

I'm not sure I understand how this would work as taxonomy != phylogeny.

Dec 19 '20 16:12 gibsramen

Thanks @gibsramen. I ment phylogeny not taxonomy

Dec 19 '20 16:12 kwcantrell

In this context, I think taxonomy is actually what you want -- what does it mean to project phylogeny up a phylogeny?

Taxonomic labels may or may not correspond to the estimated phylogenetic relationships, but in the case where there's no discordance (or in the case where discordance < some threshold), it is often nice to be able to inherit some external set of taxonomic labels using a phylogeny. Is that what you're meaning here?

Dec 19 '20 17:12 tanaes

Agree with @tanaes

For a concrete example, here is empress displaying taxonomy labels for a tip: Screen Shot 2020-12-19 at 10 22 46 AM

And this is what it shows when a non-tip node is selected: Screen Shot 2020-12-19 at 10 23 01 AM

Is the idea here that the internal nodes in the phylogeny could have some notion of the taxa that descend from it within the phylogeny?

If so, it could make sense to label internal nodes with the lowest common taxonomy ancestor of the nodes that descend from it in the phylogeny. This should be relatively easy to compute if the fields that represent levels of a taxa are relatively standardized within feature metadata. (forgive me, I am not super familiar with these q2-types at the moment).

But it could look something like this (note: this is pseudo-code):

for node in postOrderTraveral(tree):
    if not isLeaf(node): 
        for level in node.taxonomy:
            allSame, value = allChildrenHaveSameValue(node, level)
            if allSame:
                node.taxonomy[level] = value

I would imagine this is similar to what is currently being done for collapsing clades.

This could also extend to projecting"other metadata fields. In general, we would need to be careful of places where it would not make sense to project the field up the tree (like confidence scores).

Dec 19 '20 18:12 gwarmstrong

Is the idea here that the internal nodes in the phylogeny could have some notion of the taxa that descend from it within the phylogeny?

If so, it could make sense to label internal nodes with the lowest common taxonomy ancestor of the nodes that descend from it in the phylogeny. This should be relatively easy to compute if the fields that represent levels of a taxa are relatively standardized within feature metadata. (forgive me, I am not super familiar with these q2-types at the moment).

@gwarmstrong that is the idea. Basically, label internal nodes with the lowest common taxonomy of its tips.

Dec 19 '20 18:12 kwcantrell

Related to what @tanaes mentioned is there literature on what a good threshold would be in this case? We could maybe add an input users could specify but dunno what a good default would be.

Dec 19 '20 18:12 gibsramen

I guess I am not familiar enough with how the taxonomy is calculated to properly comment on this. But I would assume that the taxonomic level of internal nodes would match the lowest shared taxonomic level of its tips.

Dec 19 '20 19:12 kwcantrell

Worth taking a look at Tax2Tree from our very own @wasade!

Dec 19 '20 19:12 tanaes

For 16S taxonomy classification, taking a peek at at Tax2Tree as @tanaes metnioned, as well as https://github.com/qiime2/q2-feature-classifier should yield some answers. IIRC aligning the ASV/OTU/etc sequence against a reference, or using some other method, such as Naive Bayes to estimate the probability that a sequence is from a specific taxa are different ways one can classify taxonomy.

For metagenomics, you could take a look at woltka, kraken2, metaphlan2 just for starters on the myriad ways that taxonomy is calculated, all with their own metrics on what constitutes a "good" hit for taxonomy.

However, Empress is sequence/technology agnostic. So anything that estimates the taxonomy of some internal node using the sequence features is probably off the table (and should be, because this makes generalizing across 16S and metagenomics more difficult, or even across methods within a given sequencing techonology).

I think the most general thing we could do here is expose the same feature projections used for feature metadata clade collapsing.

Dec 19 '20 19:12 gwarmstrong

q2-feature-classifier won't place internal node labels. tax2tree will place labels on internal nodes and contention in placements. It's inputs are a phylogeny and a file containing tip -> lineage strings, and is agnostic to 16S/WGS. A visual example of the algorithm can be seen here

Dec 21 '20 16:12 wasade

...it will be more robust than the feature metadata clade collapsing. LCA does not work well for this, and getting the nesting of taxonomy ranks correct on placement can get tedious

Dec 21 '20 16:12 wasade

So it seems like tax2tree differs from LCA in that tax2tree (feel free to confirm/deny):

Does not require all descendants of an internal node to share the same label at a given level. (e.g., an internal node could be assigned to g__Clostridium, even if it has some small proportion of descendants from g__Dorea).
Newly labeled internal nodes can be used to label unlabeled descendants (known as back-filling).

I think this raises some important points:

Does empress already allow for internal node labels to be provided by feature metadata? e.g., if a user has a method for labeling internal nodes, can they supply these labels?
- If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?
For any candidate method, if it supplements the information provided by the user, how do we send a message back to the user that helps them differentiate between information they provided, and the inferred information?

Dec 21 '20 18:12 gwarmstrong

It computes an f-measure based on the observed names which descend relative to the full tree. It can replicate names if needed, as is necessary for polyphyletic groups like clostrida, or can just place a name singularly based on the maximum f-score.

Backfilling is different from labeling unlabeled descendants. An unlabeled descendant's lineage is based on the observed taxa names in the path from tip -> root. Importantly, the re-labeling may chance the original descendants lineage, and this is a good thing as taxonomy != phylogeny and particularly for reference databases, the lineages applied to input records may be incorrect.

Backfilling is used to recover gaps that may arise. For example, if you have an internal node labeled "c__Clostridia", and between it and the root, there is "d__Bacteria" but no phylum name, then we have a gap in the taxonomy. It does not make sense to have a domain and class name without a phylum name. The input lineage information can be used to reconcile this, assuming the input taxonomy is rational. In this example, we can safely infer that "p__Firmicutes" exists in that path as "c__Clostridia" are nested within "p__Firmicutes" (...unless the input taxonomy suggests otherwise...). However, we cannot determine what the correct node for "p__Firmicutes" is; as such, the most conservative placement is chosen, which is the node already containing the "c__Clostridia" label.

Dec 21 '20 18:12 wasade

Does empress already allow for internal node labels to be provided by feature metadata? e.g., if a user has a method for labeling internal nodes, can they supply these labels?

Yes, the feature metadata inputed to Empress can refer to internal nodes or to tips. More frequently it refers exclusively to tips.

https://github.com/biocore/empress/blob/d0a46edfcc1341a036c14776a291f583a090d7eb/empress/core.py#L214

If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?

Good call. An implementation of any solution to this problem should account for existing metadata and only offer this "convenience" method when there is no internal node metadata. For example you can picture a situation where the "bare" internal node view is shown to the user with an option to "infer metadata from descendants". Clicking on a control like that, should then infer the metadata, style the resulting values in a different color, and show a warning that explains why they might want to exercise caution.

Thanks for chiming in everyone, this is very helpful! 🌳

Dec 21 '20 22:12 ElDeveloper

Just popping in (agreed with @ElDeveloper, this is an awesome discussion :D) --

If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?

The default in EMPress' feature metadata coloring / clade collapsing is only respecting the feature metadata provided for the tips. It's possible to color by internal nodes' feature metadata, but doing this turns off the "propagation" of shared feature metadata up the tree, ensuring that conflicts are handled explicitly.

Default (don't use internal node feature metadata, and do "propagation"):

Allow coloring by internal node feature metadata, but disable "propagation":

Whatever solution(s) we end up going with for this, I agree that we should liberally show warnings that inferring things in this way is just an approximation and not the ground truth.

As a sidenote: this discussion brings up the mildly wonky point that, currently, EMPress treats each feature metadata field (including the various levels of taxonomy) as its own independent thing, ignoring other metadata fields. This means that, for example, if you color by Level 7 (species) in a 16S dataset using the default QIIME color map, you'll probably see a lot of clades of the tree colored as red due to all of the tips in the clade sharing a species classification of s__, even if they're from different genera/families/etc:

yike

Addressing this would definitely be possible, by for example representing the values in each Level N string as the full taxonomy to that point (e.g. setting Level 7 to k__Bacteria; p__Firmicutes; c__Somecoolclass; o__Ogeezimrunningoutoftaxonomynamesiknow; f__Isanyonereadingthis; g__Himom; s__ instead of just s__) -- in some ways this is similar to a point @antgonza raised a few weeks ago in #422.

Dec 22 '20 02:12 fedarko

Right, s__ is effectively null, so s__ != s__

Dec 22 '20 02:12 wasade

It can also be a problem with "real" names, unfortunately -- @lisa55asil brought this up in the context of Qurro a while back, there's fun stuff like P. gingivalis and H. gingivalis...

Dec 22 '20 02:12 fedarko

Yes, you definitely want to use full taxonomy strings (or equivalent) in

this scenario!

Well-defined taxonomies, like the NCBI taxonomy, have identifiers assigned to each unique taxon level name that are probably what you want

to use for this purpose. Having the capacity to handle an explicit external taxonomy in this way will probably enable all sorts of other useful applications.

On December 21, 2020, Github Notifications [email protected] wrote:

It can also be a problem with "real" names, unfortunately -- @lisa55asil https://github.com/lisa55asil brought this up in the context of Qurro a while back, there's fun stuff like P. gingivalis https://en.wikipedia.org/wiki/Porphyromonas_gingivalis and H. gingivalis https://en.wikipedia.org/wiki/Halicephalobus_gingivalis...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/empress/issues/471#issuecomment-749304658,

or unsubscribe <https://github.com/notifications/unsubscribe- auth/AB7ISAEWMMJAVKYLFFKZXUTSWABF5ANCNFSM4VB6GZNA>.

Dec 22 '20 13:12 tanaes

The species names should use genus / species to account for these scenarios. It should not be a problem for other portions of the taxonomy, unless the taxonomy is malformed. It would be crazy for c__Clostridia to associate with p__Firmicutes and p__Bacteroidetes, for example. tax2tree tests and requires the input taxonomy is actually a tree, so this scenario should be protected for already

Dec 22 '20 16:12 wasade

...sorry, it's been a few years since looking at the code, the verification that the taxonomy is hierarchical may come from t2t validate

Dec 22 '20 16:12 wasade

empress empress copied to clipboard

Project phylogeny up tree if provided

empress
empress copied to clipboard