ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Allow a list of sequence exclusion files, instead of just one

Open sacundim opened this issue 2 years ago • 0 comments

The files.exclude parameter to the workflow only allows a single sequence exclusion file per build, which means that if I want to use a custom exclusion file in my custom build, I have to either:

  1. Forego all the excellent work that's gone into the defaults/exclude.txt;
  2. Manually combine that file and my own excludes, and merge upstream changes thereafter

Proposed feature: the workflow should allow the files.exclude parameter to be a list, in which case it'll just pass all of the listed files to augur filter --exclude. The configuration would look for example like this:

files:
  auspice_config: "puerto-rico_profiles/puerto-rico_open/puerto-rico_auspice_config.json"
  description: "puerto-rico_profiles/puerto-rico_open/puerto-rico_description.md"
  exclude:
    - "defaults/exclude.txt"
    - "puerto-rico_profiles/puerto-rico_open/exclude.txt"

I prototyped this in my own custom workflows and verified that it is indeed using both exclude files:

  • https://github.com/sacundim/covid-19-puerto-rico-nextstrain/commit/af24bf24314db4b860a52ca5327bd3a268bf97c9

Some logs that verify that it worked:

augur filter \
            --metadata nextstrain-data/files/ncov/open/metadata.tsv.gz \
            --include defaults/include.txt \
            --exclude defaults/exclude.txt puerto-rico_profiles/puerto-rico_open/exclude.txt \
            --min-date 6M \
            --query 'country != '"'"'USA'"'"'' \
            --group-by region year month \
            --subsample-max-sequences 800 \
            --output-strains results/puerto-rico/sample-global_late.txt \
    2>&1 \| tee logs/subsample_puerto-rico_global_late.txt
 
Sampling at 2 per group.
5176893 strains were dropped during filtering
641 of these were dropped because they were in defaults/exclude.txt
2639423 of these were filtered out by the query: "(country == 'USA' & division != 'Puerto Rico')"
1867961 of these were dropped because they were earlier than 2022.03 or missing a date
614 of these were dropped because they were in puerto-rico_profiles/puerto-rico_open/exclude.txt
355 were dropped during grouping due to ambiguous month information
2 were dropped during grouping due to ambiguous year information
2 strains were added back because they were in defaults/include.txt
667899 of these were dropped because of subsampling criteria
656 strains passed all filters

sacundim avatar Jul 11 '22 06:07 sacundim