augur
augur copied to clipboard
Include option to prune outgroup in augur tree or augur refine
If we're rooting based an outgroup, it should be common to then remove this outgroup / root when displaying tree and doing analyses. I believe a good approach could be something like the following:
augur refine --root MH481611 --remove-outgroup
So, we explicitly specify an outgroup to root on using existing --root
functionality. We then make a new argument --remove-outgroup
that prunes an outgroup if it exists. This means that there is a single tip subtending one of the branches from the root node of the tree. If there doesn't exist a single tip, then --remove-outgroup
does nothing.
Alternatively, some of the outgroup rooting logic could be pushed into augur tree
, which is a lot slimmer than refine
. Then refine
could be just run with --keep-root
when handed a tree rooted by augur tree
that's already had outgroup pruned.
Alternatively, some of the outgroup rooting logic could be pushed into augur tree, which is a lot slimmer than refine. Then refine could be just run with --keep-root when handed a tree rooted by augur tree that's already had outgroup pruned.
Would this be better as a very divergent root wouldn't affect treetime clock inference? Would then have to ensure treetime didn't reroot!
This is true. Divergent outgroup could affect clock.
Leaving the following code snippets here which we're using to perform this function prior to building it into augur:
augur refine ... --root {params.root} ...
rule prune_outgroup:
message: "Pruning the outgroup from the tree"
input:
tree = rules.refine.output.tree
output:
tree = "results/tree_pruned.nwk"
params:
root = files.root_name
run:
from Bio import Phylo
T = Phylo.read(input[0], "newick")
outgroup = [c for c in T.find_clades() if str(c.name) == params[0]][0]
T.prune(outgroup)
Phylo.write(T, output[0], "newick")
I think this is worth attending to at some point. There are a bunch of SARS-CoV-2 analyses that now want to work with just a more recent clade rather than a full tree. Here, these trees should still be rooted off of Wuhan-Hu-1/2019, but we'd like Wuhan-Hu-1/2019 removed during the augur refine
step. But not urgent.
Bumping this issue as a solution to the incorrect rooting problem we are seeing in seasonal influenza builds. (see Slack thread).
+1 for a flag that removes the outgroup. As I mentioned in a related PR, we may want to rename --remove-outgroup
to --remove-root
, to make the flag name consistent with existing --root
and --keep-root
flags.
Another bump, and a comment 😉 I'd suggest this be done inside augur refine
so that augur tree
can more easily be swapped out with other tree builders.
In augur refine
the rooting of trees is done within Treetime (both via tt.clock_filter()
and tt.run()
) so it will require changes there. and I haven't dug into this but we would want the (removed) root to not affect clock calculations.
If we're not inferring a timetree then it's much simpler as we can prune the root straight after rerooting, and this change would be part of augur refine
not treetime.
@joverlee521, @j23414, and I looked into this a bit more today and came to the same conclusion that supporting a --remove-outgroup
flag in augur refine
would require modification inside TreeTime's run
function. As @jameshadfield notes above, the root gets used when the refine
function calls the clock filter and then again in the call to TreeTime's run
function. If we want augur refine
to root the tree on a specific sequence and then prune it from the final output without affecting clock calculations, there seem to be a lot of TreeTime internals to keep track of. This could be fine, since we have the TreeTime experts on the team. Maybe @rneher or @anna-parker could say how easy this change would be?
In our discussion today, we wondered why rooting (and pruning of the root) isn't part of augur tree
, since a rooted divergence tree can be a useful output on its own even if you don't plan to build a time tree and the logic for rooting a divergence tree is much simpler. Different tree builders handle rooting differently. IQ-TREE will root on the first sequence in the alignment by default and allows you to specify an outgroup sequence name. RAxML produces unrooted trees or optionally runs an algorithm to find the best root. FastTree uses an arbitrary root. What if we provided --root
and --remove-root
arguments to augur tree
such that when a root is given, we run BioPython's root_with_outgroup
(as we do in augur refine) after the tree is built. If the user requests that the root is removed, we remove it and write out the resulting tree. As Trevor points out in the original issue, the logic required to handle rooting in augur tree
is much simpler than digging into TreeTime internals. With this approach, we can run augur refine
with --keep-root
after pruning the root previously and not have to worry about how the clock calculations will be affected.
I see an argument against the augur tree
interface I'm suggesting here in that it creates two different rooting interfaces (one in tree
and one in refine
) instead of defining a single place in Augur where this rooting happens. I also see the argument that we want people to be able to switch out their tree builders without losing this rooting functionality.
Hi- @huddlej @joverlee521 and @j23414 I hope I have understood this thread correctly - at the moment TreeTime supports rooting the tree on an outgroup - the outgroup needs to be given to the --reroot
argument to reroot on the outgroup (or a list of nodes to be used as an outgroup). It is easy for me to add an argument to TreeTime to remove this outgroup at the end of calculations. In the case of refine this would mean the outgroup would be a part of the tree for all calculations and then only removed at the end - do I understand correctly this is what is desired?
In the case of refine this would mean the outgroup would be a part of the tree for all calculations and then only removed at the end - do I understand correctly this is what is desired?
Ideally this would not be the case -- the outgroup shouldn't bias the clock calculations (as it's often really divergent).
The easiest solutions seems to be to prune the root in augur tree
and then use augur refine --keep-root
. It's not as clean, but it'll work well. And as @huddlej points out, it's often desirable to get a newick tree sans root.
we could add such a --remove-root --root taxonXY
either to refine
or to tree
. If we want to keep the rooting within refine
, we would have to
- check whether rooting is with an outgroup, if so, reroot.
- check whether to remove the outgroup, if so, remove
- run
refine()
specifying that the root is to be kept
this would also remove some duplication here: https://github.com/nextstrain/augur/blob/master/augur/refine.py#L255-L266
@rneher's point above makes it clearer to me how we could do this without touching TreeTime internals. Thinking through the user experience of asking augur refine
to --remove-root
, though, it won't be obvious to users which root gets removed and when because the --root
argument provides so many different options (root with one outgroup, root with two outgroup sequences that form a clade, root with oldest, etc.). If we include --remove-root
in refine
, it will only make sense for the specific case when the user provides a single outgroup and this removal needs to happen before TreeTime runs. We could handle the cases to remove the root for other rooting arguments, but this removal needs to happen after TreeTime runs (assigning the root internally). Some examples to think through this are:
# Minimal root with outgroup followed by removal of root.
# Rooting and removing of root happen before TreeTime runs.
augur refine \
--tree tree_raw.nwk \
--root "Wuhan/Hu-1/2019"
--remove-root \
--output-tree tree.nwk
# Root with two outgroup sequences followed by removal of root.
# Rooting happens when TreeTime runs either by clock filter or by TreeTime.run().
# The root affects the clock filter and time tree calculations.
# Removal of root has to happen after TreeTime runs and we know deterministically which tips are removed.
augur refine \
--tree tree_raw.nwk \
--root "Wuhan/Hu-1/2019" "Wuhan-Hu-1/2019"
--remove-root \
--output-tree tree.nwk
# Root with "best" root based on root-to-tip regression followed by removal of root.
# Rooting happens when TreeTime runs either by clock filter or by TreeTime.run().
# The root affects the clock filter and time tree calculations.
# Removal of root has to happen after TreeTime runs and we don't know which tip is removed
# since it is calculated by TreeTime at runtime.
augur refine \
--tree tree_raw.nwk \
--root best
--remove-root \
--output-tree tree.nwk
To avoid handling each of these use cases, we could implement --remove-root
in refine
such that it only works when the root is a single outgroup. This implementation could confuse or frustrate users when they first encounter it, but at least root removal would be unambiguously applied to a single use case. If we used @trvrb's original flag name of --remove-outgroup
, that might make the single-use functionality of that flag clearer to users and absolve us from needing to support the more complex use cases suggested by a --remove-root
flag.
If we want to support removing the root for all other --root
options, we'd either need to be ok with the root affecting the TreeTime calculations or pull out the intelligent rooting logic for TreeTime into an earlier section of refine
where it can run prior to clock filter and time tree calculations.
The alternative suggestion of implementing the root/remove root in augur tree
would look like this:
# Root the divergence tree after building it using a single outgroup we expect to exist in the alignment.
augur tree \
--alignment aligned.fasta \
--root "Wuhan/Hu-1/2019"
--output tree_raw.nwk
# Root the divergence tree after building it using a single outgroup we expect to exist in the alignment.
# Remove the root after rooting.
augur tree \
--alignment aligned.fasta \
--root "Wuhan/Hu-1/2019"
--remove-root \
--output tree_raw.nwk
augur tree
can't provide any more sophisticated rooting options beyond the single outgroup. Rooting and removal of the root must happen (unambiguously) after the tree is built. There are fewer use cases to support here and no special casing for --remove-root
behavior. This implementation does expand the scope of the tree
command to include logic that we might consider "refinement", though. It also introduces another root
argument that could confuse users when they have to choose between tree
and refine
for rooting.
After working through these examples, I think implementing the --remove-outgroup
flag in refine
as @trvrb originally suggested does seem to provide the most consistent user experience of the options explored above. (I made this all more complicated by suggesting --remove-root
previously.) The proposed implementation would then look like:
# Minimal root with _outgroup_ followed by removal of the outgroup specified as the root.
# Rooting and removing of root happen before TreeTime runs.
augur refine \
--tree tree_raw.nwk \
--root "Wuhan/Hu-1/2019"
--remove-outgroup \
--output-tree tree.nwk
# Not supported: Root with "best" root based on root-to-tip regression followed by removal of outgroup.
# This doesn't work because the user doesn't provide an _outgroup_ to root with.
augur refine \
--tree tree_raw.nwk \
--root best
--remove-outgroup \
--output-tree tree.nwk
We don't actually use the term "outgroup" anywhere in the augur refine
help text, so we would need to reword the --root
description to clarify that "outgroup" refers specifically to a single tip in the tree.
Thanks for this really detailed take John and working through with examples! I agree with you. In particular:
I think moving rooting/removing-root to tree
seems like an ok idea in theory, but I can just imagine the confusion that will ensue when people use this in tree
and then have some setting in refine
which reroots the tree (for example by best
) and then people have no idea why their rooting isn't working as expected... so I'd agree that keeping all the "rooting" in one place will make this clearer.
I also agree about "remove root" being only really useful in cases where we unambiguously know what the user wants removing. For using best
I can't even imagine a legitimate use-case here - it's not unusual for best
to end up in the middle of your tree (or at least significantly within your tree) - which half would then be removed, and would anyone actually want this? It seems like you'd have to know your data intimately well (and be very secure in how the runs go, every time!) to be using this approach confidently & intentionally, in which case there is probably a better way of doing it! 😆
The only remaining case I'd perhaps be on the fence about is in specifying two sequences where the monophyletic MRCA is the "root" (your example 2). If you've selected an outgroup well and you know what you're doing, then it'll remove exactly what you expect, and this could be useful (sometimes supplying 2-3 outgroup seqs can give better results than just 1 seq, in my experience). However, if you are using slightly less 'clear' outgroups you might end up deleting much more than you intended to (especially if you don't check that your tree came out how you think before diving into refine
) - but that may be an acceptable trade-off! (And I wasn't 100% clear if you intended this to be supported or not in your final examples.)
I think in the end I'd probably vote for those two use-cases to be supported (one seq supplied, or two and remove the whole clade), but not supporting more ambiguous options.