augur Allow user to exclude specific strains from refine's clock filter

Context

When augur refine runs with the --clock-filter-iqd argument, refine can prune older strains in the tree that user may have force-included during subsampling. For example, a recent seasonal-flu build for B/Vic failed when an important reference strain, B/Austria/1359417/2021, got pruned by refine's clock filter and a downstream titer model that relied on that strain couldn't run as expected. The only way to exclude that strain from the clock filter was to disable the clock filter.

In other places where we prune outliers, like this part of the seasonal-flu workflow, we allow users to specify strains to exclude from the outlier set.

Description

We should provide an argument like --keep-strains or --include-strains (I'm not too attached to a specific argument name) which takes a path to a text file with one strain per line representing all strains that should be excluded from the clock filter.

Examples

In the seasonal-flu build mentioned above, the proposed augur refine command might look like this.

augur refine \
    --tree builds/vic_2y_titers/ha/tree_common.nwk \
    --alignment builds/vic_2y_titers/ha/aligned.fasta \
    --metadata builds/vic_2y_titers/metadata.tsv \
    --output-tree builds/vic_2y_titers/ha/tree.nwk \
    --output-node-data builds/vic_2y_titers/ha/branch-lengths.json \
    --keep-root \
    --stochastic-resolve \
    --timetree \
    --use-fft \
    --no-covariance \
    --clock-rate 0.00145 \
    --clock-std-dev 0.00029 \
    --coalescent const \
    --date-confidence \
    --date-inference marginal \
    --clock-filter-iqd 4 \
    --keep-strains config/vic/reference_strains.txt

Feb 27 '25 23:02 huddlej

Adding --keep-ids as another option for naming, based on the following reasoning:

Not --include-*. In practice, I assume the file used for this option will be the same that is used for augur filter --include. For this reason I initially thought it would be a good idea to use the same option name --include, but augur refine --include can be confusing – I can see potential misinterpretation as "these sequences will be considered for refinement" rather than "these sequences will be included in the output".
--keep-* seems fine. There is already --keep-root and --keep-polytomies which serve different purposes, but I don't see --keep-ids as conflicting with either of those.
Not *-strain per discussion in https://github.com/nextstrain/augur/issues/877.
*-ids matches --metadata-id-columns.

Mar 14 '25 00:03 victorlin

This should be easy to implement here:

https://github.com/nextstrain/augur/blob/fe72facfce0937d65fdc7c79c236b86cd21a6b5f/augur/refine.py#L46-L53

use augur.io.read_strains to set keep_ids

add another condition before pruning:

for n in leaves:
    if n.bad_branch and n.name not in keep_ids:
        tt.tree.prune(n)
        print('pruning leaf ', n.name)

Mar 14 '25 00:03 victorlin

Thanks, @victorlin! I like --keep-ids and your proposed implementation. It will probably take longer to write the tests than the desired functionality... 😄

Mar 18 '25 21:03 huddlej