augur icon indicating copy to clipboard operation
augur copied to clipboard

Include option to prune outgroup in augur tree or augur refine

Open trvrb opened this issue 5 years ago • 10 comments

If we're rooting based an outgroup, it should be common to then remove this outgroup / root when displaying tree and doing analyses. I believe a good approach could be something like the following:

augur refine --root MH481611 --remove-outgroup

So, we explicitly specify an outgroup to root on using existing --root functionality. We then make a new argument --remove-outgroup that prunes an outgroup if it exists. This means that there is a single tip subtending one of the branches from the root node of the tree. If there doesn't exist a single tip, then --remove-outgroup does nothing.

Alternatively, some of the outgroup rooting logic could be pushed into augur tree, which is a lot slimmer than refine. Then refine could be just run with --keep-root when handed a tree rooted by augur tree that's already had outgroup pruned.

trvrb avatar Aug 01 '19 00:08 trvrb

Alternatively, some of the outgroup rooting logic could be pushed into augur tree, which is a lot slimmer than refine. Then refine could be just run with --keep-root when handed a tree rooted by augur tree that's already had outgroup pruned.

Would this be better as a very divergent root wouldn't affect treetime clock inference? Would then have to ensure treetime didn't reroot!

jameshadfield avatar Aug 01 '19 01:08 jameshadfield

This is true. Divergent outgroup could affect clock.

trvrb avatar Aug 01 '19 01:08 trvrb

Leaving the following code snippets here which we're using to perform this function prior to building it into augur:

augur refine ... --root {params.root} ...
rule prune_outgroup:
    message: "Pruning the outgroup from the tree"
    input:
        tree = rules.refine.output.tree
    output:
        tree = "results/tree_pruned.nwk"
    params:
        root = files.root_name
    run:
        from Bio import Phylo
        T = Phylo.read(input[0], "newick")
        outgroup = [c for c in T.find_clades() if str(c.name) == params[0]][0]
        T.prune(outgroup)
        Phylo.write(T, output[0], "newick")

jameshadfield avatar Oct 22 '19 22:10 jameshadfield

I think this is worth attending to at some point. There are a bunch of SARS-CoV-2 analyses that now want to work with just a more recent clade rather than a full tree. Here, these trees should still be rooted off of Wuhan-Hu-1/2019, but we'd like Wuhan-Hu-1/2019 removed during the augur refine step. But not urgent.

trvrb avatar Jan 22 '21 06:01 trvrb

Bumping this issue as a solution to the incorrect rooting problem we are seeing in seasonal influenza builds. (see Slack thread).

joverlee521 avatar Sep 19 '22 21:09 joverlee521

+1 for a flag that removes the outgroup. As I mentioned in a related PR, we may want to rename --remove-outgroup to --remove-root, to make the flag name consistent with existing --root and --keep-root flags.

huddlej avatar Sep 22 '22 19:09 huddlej

Another bump, and a comment 😉 I'd suggest this be done inside augur refine so that augur tree can more easily be swapped out with other tree builders.

In augur refine the rooting of trees is done within Treetime (both via tt.clock_filter() and tt.run()) so it will require changes there. and I haven't dug into this but we would want the (removed) root to not affect clock calculations.

If we're not inferring a timetree then it's much simpler as we can prune the root straight after rerooting, and this change would be part of augur refine not treetime.

jameshadfield avatar Sep 22 '22 21:09 jameshadfield

@joverlee521, @j23414, and I looked into this a bit more today and came to the same conclusion that supporting a --remove-outgroup flag in augur refine would require modification inside TreeTime's run function. As @jameshadfield notes above, the root gets used when the refine function calls the clock filter and then again in the call to TreeTime's run function. If we want augur refine to root the tree on a specific sequence and then prune it from the final output without affecting clock calculations, there seem to be a lot of TreeTime internals to keep track of. This could be fine, since we have the TreeTime experts on the team. Maybe @rneher or @anna-parker could say how easy this change would be?

In our discussion today, we wondered why rooting (and pruning of the root) isn't part of augur tree, since a rooted divergence tree can be a useful output on its own even if you don't plan to build a time tree and the logic for rooting a divergence tree is much simpler. Different tree builders handle rooting differently. IQ-TREE will root on the first sequence in the alignment by default and allows you to specify an outgroup sequence name. RAxML produces unrooted trees or optionally runs an algorithm to find the best root. FastTree uses an arbitrary root. What if we provided --root and --remove-root arguments to augur tree such that when a root is given, we run BioPython's root_with_outgroup (as we do in augur refine) after the tree is built. If the user requests that the root is removed, we remove it and write out the resulting tree. As Trevor points out in the original issue, the logic required to handle rooting in augur tree is much simpler than digging into TreeTime internals. With this approach, we can run augur refine with --keep-root after pruning the root previously and not have to worry about how the clock calculations will be affected.

I see an argument against the augur tree interface I'm suggesting here in that it creates two different rooting interfaces (one in tree and one in refine) instead of defining a single place in Augur where this rooting happens. I also see the argument that we want people to be able to switch out their tree builders without losing this rooting functionality.

huddlej avatar Sep 22 '22 22:09 huddlej

Hi- @huddlej @joverlee521 and @j23414 I hope I have understood this thread correctly - at the moment TreeTime supports rooting the tree on an outgroup - the outgroup needs to be given to the --reroot argument to reroot on the outgroup (or a list of nodes to be used as an outgroup). It is easy for me to add an argument to TreeTime to remove this outgroup at the end of calculations. In the case of refine this would mean the outgroup would be a part of the tree for all calculations and then only removed at the end - do I understand correctly this is what is desired?

anna-parker avatar Sep 26 '22 14:09 anna-parker

In the case of refine this would mean the outgroup would be a part of the tree for all calculations and then only removed at the end - do I understand correctly this is what is desired?

Ideally this would not be the case -- the outgroup shouldn't bias the clock calculations (as it's often really divergent).

The easiest solutions seems to be to prune the root in augur tree and then use augur refine --keep-root. It's not as clean, but it'll work well. And as @huddlej points out, it's often desirable to get a newick tree sans root.

jameshadfield avatar Sep 26 '22 21:09 jameshadfield

we could add such a --remove-root --root taxonXY either to refine or to tree. If we want to keep the rooting within refine, we would have to

  • check whether rooting is with an outgroup, if so, reroot.
  • check whether to remove the outgroup, if so, remove
  • run refine() specifying that the root is to be kept

this would also remove some duplication here: https://github.com/nextstrain/augur/blob/master/augur/refine.py#L255-L266

rneher avatar Nov 16 '22 20:11 rneher

@rneher's point above makes it clearer to me how we could do this without touching TreeTime internals. Thinking through the user experience of asking augur refine to --remove-root, though, it won't be obvious to users which root gets removed and when because the --root argument provides so many different options (root with one outgroup, root with two outgroup sequences that form a clade, root with oldest, etc.). If we include --remove-root in refine, it will only make sense for the specific case when the user provides a single outgroup and this removal needs to happen before TreeTime runs. We could handle the cases to remove the root for other rooting arguments, but this removal needs to happen after TreeTime runs (assigning the root internally). Some examples to think through this are:

# Minimal root with outgroup followed by removal of root.
# Rooting and removing of root happen before TreeTime runs.
augur refine \
  --tree tree_raw.nwk \
  --root "Wuhan/Hu-1/2019"
  --remove-root \
  --output-tree tree.nwk

# Root with two outgroup sequences followed by removal of root.
# Rooting happens when TreeTime runs either by clock filter or by TreeTime.run().
# The root affects the clock filter and time tree calculations.
# Removal of root has to happen after TreeTime runs and we know deterministically which tips are removed.
augur refine \
  --tree tree_raw.nwk \
  --root "Wuhan/Hu-1/2019" "Wuhan-Hu-1/2019"
  --remove-root \
  --output-tree tree.nwk

# Root with "best" root based on root-to-tip regression followed by removal of root.
# Rooting happens when TreeTime runs either by clock filter or by TreeTime.run().
# The root affects the clock filter and time tree calculations.
# Removal of root has to happen after TreeTime runs and we don't know which tip is removed
# since it is calculated by TreeTime at runtime.
augur refine \
  --tree tree_raw.nwk \
  --root best
  --remove-root \
  --output-tree tree.nwk

To avoid handling each of these use cases, we could implement --remove-root in refine such that it only works when the root is a single outgroup. This implementation could confuse or frustrate users when they first encounter it, but at least root removal would be unambiguously applied to a single use case. If we used @trvrb's original flag name of --remove-outgroup, that might make the single-use functionality of that flag clearer to users and absolve us from needing to support the more complex use cases suggested by a --remove-root flag.

If we want to support removing the root for all other --root options, we'd either need to be ok with the root affecting the TreeTime calculations or pull out the intelligent rooting logic for TreeTime into an earlier section of refine where it can run prior to clock filter and time tree calculations.

The alternative suggestion of implementing the root/remove root in augur tree would look like this:

# Root the divergence tree after building it using a single outgroup we expect to exist in the alignment.
augur tree \
  --alignment aligned.fasta \
  --root "Wuhan/Hu-1/2019"
  --output tree_raw.nwk

# Root the divergence tree after building it using a single outgroup we expect to exist in the alignment.
# Remove the root after rooting.
augur tree \
  --alignment aligned.fasta \
  --root "Wuhan/Hu-1/2019"
  --remove-root \
  --output tree_raw.nwk

augur tree can't provide any more sophisticated rooting options beyond the single outgroup. Rooting and removal of the root must happen (unambiguously) after the tree is built. There are fewer use cases to support here and no special casing for --remove-root behavior. This implementation does expand the scope of the tree command to include logic that we might consider "refinement", though. It also introduces another root argument that could confuse users when they have to choose between tree and refine for rooting.

After working through these examples, I think implementing the --remove-outgroup flag in refine as @trvrb originally suggested does seem to provide the most consistent user experience of the options explored above. (I made this all more complicated by suggesting --remove-root previously.) The proposed implementation would then look like:

# Minimal root with _outgroup_ followed by removal of the outgroup specified as the root.
# Rooting and removing of root happen before TreeTime runs.
augur refine \
  --tree tree_raw.nwk \
  --root "Wuhan/Hu-1/2019"
  --remove-outgroup \
  --output-tree tree.nwk

# Not supported: Root with "best" root based on root-to-tip regression followed by removal of outgroup.
# This doesn't work because the user doesn't provide an _outgroup_ to root with.
augur refine \
  --tree tree_raw.nwk \
  --root best
  --remove-outgroup \
  --output-tree tree.nwk

We don't actually use the term "outgroup" anywhere in the augur refine help text, so we would need to reword the --root description to clarify that "outgroup" refers specifically to a single tip in the tree.

huddlej avatar Jan 26 '23 23:01 huddlej

Thanks for this really detailed take John and working through with examples! I agree with you. In particular:

I think moving rooting/removing-root to tree seems like an ok idea in theory, but I can just imagine the confusion that will ensue when people use this in tree and then have some setting in refine which reroots the tree (for example by best) and then people have no idea why their rooting isn't working as expected... so I'd agree that keeping all the "rooting" in one place will make this clearer.

I also agree about "remove root" being only really useful in cases where we unambiguously know what the user wants removing. For using best I can't even imagine a legitimate use-case here - it's not unusual for best to end up in the middle of your tree (or at least significantly within your tree) - which half would then be removed, and would anyone actually want this? It seems like you'd have to know your data intimately well (and be very secure in how the runs go, every time!) to be using this approach confidently & intentionally, in which case there is probably a better way of doing it! 😆

The only remaining case I'd perhaps be on the fence about is in specifying two sequences where the monophyletic MRCA is the "root" (your example 2). If you've selected an outgroup well and you know what you're doing, then it'll remove exactly what you expect, and this could be useful (sometimes supplying 2-3 outgroup seqs can give better results than just 1 seq, in my experience). However, if you are using slightly less 'clear' outgroups you might end up deleting much more than you intended to (especially if you don't check that your tree came out how you think before diving into refine) - but that may be an acceptable trade-off! (And I wasn't 100% clear if you intended this to be supported or not in your final examples.)

I think in the end I'd probably vote for those two use-cases to be supported (one seq supplied, or two and remove the whole clade), but not supporting more ambiguous options.

emmahodcroft avatar Jan 27 '23 08:01 emmahodcroft