ncov
ncov copied to clipboard
Allow users to disable clock filter in augur refine rule
Context
In the refine step of the ncov workflow, the --clock-filter-iqd
argument is hardcoded such that TreeTime will always apply its clock filter to strains that fall beyond the given number of interquartile distances. This default behavior can cause strains to be pruned from the final tree even if the user has explicitly requested those strains to be included in the analysis with the --include
flag in the filter commands. As one user summarizes:
it would be generally useful to have a specific toggle for [clock filter pruning], as very often you are mainly interested in retaining a subset of viral genomes and looking at the phylogenetic relationships in global context. You can specify these genomes in the “include.txt” file in the config subdirectory but it was unexpected for us to see a few being dropped at the refinement level.
Description
Ideally, we would have a way for users to easily toggle the clock filter argument on or off. One current way to effectively disable the clock filter is to increase the n-IQD value passed to the argument to such a large number that it never filters anything. A better solution would be to allow users to explicitly disable the clock filter.
Possible solutions
Some possible solutions would be to allow the user to set the clock filter parameter to False
or None
or an empty string as in the example YAML entries below. Then we would replace the current clock filter parameter in the workflow with a function that returns the appropriate clock filter argument when the user has requested it and an empty string when it has been disabled. We use this same approach for enabling/disabling specific filter types in the subsampling logic.
refine:
# Current default value, should enable clock filter argument.
clock_filter_iqd: 4
# Extreme value, uses the clock filter argument but effectively does not filter most strains
clock_filter_iqd: 100
# Disable clock filter by specifying a boolean value.
clock_filter_iqd: False
# Disable clock filter by specifying an empty value.
clock_filter_iqd:
Another possible solution would be to modify the augur refine
command to take a list of strains to always include despite the clock filter. This approach would require us to pass an include.txt
type of input to the refine rule and make a new augur release.
Thanks for summarizing John. I'm torn here. It seems like the best command line interface is to either include --clock-filter-iqd 4
or just leave out --clock-filter-iqd
entirely. But I see why this doesn't work with the way parameters.yml
is specified. I'm okay with --clock-filter-iqd False
or --clock-filter-iqd 100
.
With the boolean-valued approach, I was actually imagining omitting the argument from the shell completely when the value is False
. This would work the same way we optionally include a "priority" argument for subsampling:
- Replace the hardcoded argument with a reference to a string parameter
- Load the parameter's value from a function instead of a direct lookup into the config
- Use information about the build (e.g., build config attributes) to conditionally return the complete argument string or an empty string
When the user sets the clock filter parameter to a non-False value (e.g., 4
), they should get a command like this:
augur refine \
--tree results/washington/tree_raw.nwk \
[... snip ...]
--no-covariance \
--clock-filter-iqd 4 2>&1 | tee logs/refine.txt
When the user sets the clock filter parameter value to False
, they should get the following command:
augur refine \
--tree results/washington/tree_raw.nwk \
[... snip ...]
--no-covariance 2>&1 | tee logs/refine.txt
With the boolean-valued approach, I was actually imagining omitting the argument from the shell completely when the value is
False
.
👍
I think this is a good approach John. I like the idea to under-the-hood switch it goes to no argument. Fair enough to surface this if people will find it useful!