Allow user to exclude specific strains from refine's clock filter
Context
When augur refine runs with the --clock-filter-iqd argument, refine can prune older strains in the tree that user may have force-included during subsampling. For example, a recent seasonal-flu build for B/Vic failed when an important reference strain, B/Austria/1359417/2021, got pruned by refine's clock filter and a downstream titer model that relied on that strain couldn't run as expected. The only way to exclude that strain from the clock filter was to disable the clock filter.
In other places where we prune outliers, like this part of the seasonal-flu workflow, we allow users to specify strains to exclude from the outlier set.
Description
We should provide an argument like --keep-strains or --include-strains (I'm not too attached to a specific argument name) which takes a path to a text file with one strain per line representing all strains that should be excluded from the clock filter.
Examples
In the seasonal-flu build mentioned above, the proposed augur refine command might look like this.
augur refine \
--tree builds/vic_2y_titers/ha/tree_common.nwk \
--alignment builds/vic_2y_titers/ha/aligned.fasta \
--metadata builds/vic_2y_titers/metadata.tsv \
--output-tree builds/vic_2y_titers/ha/tree.nwk \
--output-node-data builds/vic_2y_titers/ha/branch-lengths.json \
--keep-root \
--stochastic-resolve \
--timetree \
--use-fft \
--no-covariance \
--clock-rate 0.00145 \
--clock-std-dev 0.00029 \
--coalescent const \
--date-confidence \
--date-inference marginal \
--clock-filter-iqd 4 \
--keep-strains config/vic/reference_strains.txt
Adding --keep-ids as another option for naming, based on the following reasoning:
- Not
--include-*. In practice, I assume the file used for this option will be the same that is used foraugur filter --include. For this reason I initially thought it would be a good idea to use the same option name--include, butaugur refine --includecan be confusing – I can see potential misinterpretation as "these sequences will be considered for refinement" rather than "these sequences will be included in the output". -
--keep-*seems fine. There is already--keep-rootand--keep-polytomieswhich serve different purposes, but I don't see--keep-idsas conflicting with either of those. - Not
*-strainper discussion in https://github.com/nextstrain/augur/issues/877. -
*-idsmatches--metadata-id-columns.
This should be easy to implement here:
https://github.com/nextstrain/augur/blob/fe72facfce0937d65fdc7c79c236b86cd21a6b5f/augur/refine.py#L46-L53
-
use
augur.io.read_strainsto setkeep_ids -
add another condition before pruning:
for n in leaves: if n.bad_branch and n.name not in keep_ids: tt.tree.prune(n) print('pruning leaf ', n.name)
Thanks, @victorlin! I like --keep-ids and your proposed implementation. It will probably take longer to write the tests than the desired functionality... 😄