ncov
ncov copied to clipboard
Allow a list of sequence exclusion files, instead of just one
The files.exclude
parameter to the workflow only allows a single sequence exclusion file per build, which means that if I want to use a custom exclusion file in my custom build, I have to either:
- Forego all the excellent work that's gone into the
defaults/exclude.txt
; - Manually combine that file and my own excludes, and merge upstream changes thereafter
Proposed feature: the workflow should allow the files.exclude
parameter to be a list, in which case it'll just pass all of the listed files to augur filter --exclude
. The configuration would look for example like this:
files:
auspice_config: "puerto-rico_profiles/puerto-rico_open/puerto-rico_auspice_config.json"
description: "puerto-rico_profiles/puerto-rico_open/puerto-rico_description.md"
exclude:
- "defaults/exclude.txt"
- "puerto-rico_profiles/puerto-rico_open/exclude.txt"
I prototyped this in my own custom workflows and verified that it is indeed using both exclude files:
- https://github.com/sacundim/covid-19-puerto-rico-nextstrain/commit/af24bf24314db4b860a52ca5327bd3a268bf97c9
Some logs that verify that it worked:
augur filter \
--metadata nextstrain-data/files/ncov/open/metadata.tsv.gz \
--include defaults/include.txt \
--exclude defaults/exclude.txt puerto-rico_profiles/puerto-rico_open/exclude.txt \
--min-date 6M \
--query 'country != '"'"'USA'"'"'' \
--group-by region year month \
--subsample-max-sequences 800 \
--output-strains results/puerto-rico/sample-global_late.txt \
2>&1 \| tee logs/subsample_puerto-rico_global_late.txt
Sampling at 2 per group.
5176893 strains were dropped during filtering
641 of these were dropped because they were in defaults/exclude.txt
2639423 of these were filtered out by the query: "(country == 'USA' & division != 'Puerto Rico')"
1867961 of these were dropped because they were earlier than 2022.03 or missing a date
614 of these were dropped because they were in puerto-rico_profiles/puerto-rico_open/exclude.txt
355 were dropped during grouping due to ambiguous month information
2 were dropped during grouping due to ambiguous year information
2 strains were added back because they were in defaults/include.txt
667899 of these were dropped because of subsampling criteria
656 strains passed all filters