ncov
ncov copied to clipboard
Keep small number of BA.2 on the tree and not filtered by clock_filter
Context
On a general, all-lineage tree, when we have only 1 focal BA.2 sample, which pulled in just a few contextual BA.2 samples, most BA.2 samples ended up in excluded_by_diagnostics.txt file and my best understanding is they got removed by the clock_filter
set here. When not having enough (<100) BA.2 samples and their offset
fell back to the default of 2 (per here). Below from the GenBank open data Nextstrain prepares (thank you!), I calculated the deviation
and it's near 25 for BA.2 (21L), so the combination of a default offset = 2
and clock_filter = 20
will nicely excludes many of them XD
Description Have a way to keep lineages w large number of real mutations in the tree more robustly.
Possible solution
For our build, I (lacking enough Python skills) changed the code to use a much smaller number of samples to calculate offset. A more robust solution may be to feed in a fixed offset
for each clade that's calculated using a large dataset and not to rely on the samples in each particular tree.
Somewhat related to https://github.com/nextstrain/ncov/issues/852.
The enhancement request is easy to satisfy by factoring out the min_clade_member_count_threshold_for_offset
100
as an optional parameter (with default 100
for backwards compatibility). If that number is made smaller, clock filter becomes noisier for small clades, but it's sometimes (at user's risk) better than throwing out all clade members because they don't reach 100
.
Within nextstrain we don't have this problem (usually) because new clades usually have more than 100 members in GISAID/Genbank right from the beginning. The 100 may thus not be very carefully chosen.
Updated: We're wrapping up this sprint and will leave this issue to the original assignee ;)
Original:
We're planning to tackle #852 this sprint. @corneliusroemer do you want us to bring along this one (to surface min_clade_member_count_threshold_for_offset
as a parameter per your suggestion) as well? Since they together control clock filter (is that right??)