ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Implement subsampling via a script

Open jameshadfield opened this issue 4 years ago • 0 comments

Description of proposed changes

This implements subsampling via a single script, rather than a series of snakemake commands. The behavior is intended to be the same, and the the subsampling definitions (in builds.yaml) have not changed. This script was initially developed as augur subsample in https://github.com/nextstrain/augur/pull/762, however we are choosing to move development here to allow us to improve the proximity/priority calculations here. The eventual aim is for this functionality to be part of augur.

Verbosity The subsample script contains overly verbose print statements. Some of these should be removed before merge, but they may be helpful for review.

Subsampling syntax We introduce a new rule, extract_subsampling_scheme, which takes the subsampling definition (in builds.yaml), applies wildcard expansion, and then transforms this into a syntax more in line with how augur filter is called. This rule replaces a number of functions formerly in the snakefiles. As an example:

## builds.yaml
subsampling:
  custom-scheme:
    sampleName:
      exclude: "--exclude-where 'aus!=yes'"
      group_by: year month
      seq_per_group: 5

# output from rule extract_subsampling_scheme
sampleName:
  exclude-where:
  - aus!=yes
  group-by:
  - year
  - month
  sequences-per-group: 5

Going forward, I suggest slowly updaating our nCoV subsampling definitions to the latter format. This syntax is more in line with the rest of the augur ecosystem and avoids the sticking point where some values need to specify the --argument and some don't (this has tripped up people). It is also easier (I think) to reason with YAML arrays than strings which are then coerced into arrays.

DAG simplification

By essentially bringing the complexity into the subsample script (eventually to be augur subsample), we have a simpler snakemake DAG, with far fewer rules, which should be easier to port to other workflow languages. The DAG for subsampling now only contains two steps per build: extract_subsampling_scheme (see above) and subsample.

image

DAG for Nextstrain Open (GenBank) builds. Top: current master, bottom: this PR

Future improvements

  • Instead of using augur filter’s run() function, we could refactor that function slightly to have a function which returns a strain list for inclusion, as well as logging data etc. Currently the run() function writes data to disk which we immediately read in.
  • Similarly, this work doesn't change the behavior of the priorities.py and get_distance_to_focal_set.py scripts. These can be refactored to return data rather than writing to disk.
  • Include / exclude files are defined by arguments to augur subsample, however they could be set within the scheme definition, which would allow per-sample differences. (Breaking change.)
  • Allow a log file to be written explaining why strains were not included (or why they were!) which leverages the new logging functionality of augur filter
  • Expand priorities schema definition so that sample definitions define ignore_seqs, which is currently hardcoded to be "Wuhan/Hu-1/2019". Similarly for proximity / priority config parameters. (Breaking change.)
  • For samples which can be computed in parallel (e.g. those which don’t need priorities), in principle we could refactor the code in augur filter to allow independent sets of (sets of) filters to be applied each time we iterate over a chunk of metadata. This should speed things up quite a bit, but would come at increased code complexity. I don’t think this is immediately necessary.
  • Proximity calculations can be slow for large datasets. We can speed this up by prefiltering based on certain settings - for instance, if we are computing a sample with a --min-date and proximity to another sample, we can do this in two stages so that the proximity calculations are run against a pre-filtered set (via `--min-date).

Related issue(s)

Related to https://github.com/nextstrain/augur/issues/635

Testing

Test builds triggered (via GitHub). I will update this section when they complete. Update: they failed. Updated PR, and AWS info below updated.

## open
 nextstrain build --aws-batch --attach 8008fccf-24de-4d64-86c8-d5e0a887cc54   .
https://nextstrain.org/staging/ncov/open/trial/15426a2dbfbb0739560c431c1db1da7d0f60184b/{global, ...}

AWS batch console link

Release checklist

This PR should not introduce any backwards incompatible changes

  • [ ] Update docs/change_log.md in this pull request to document these changes by the date they were added.

jameshadfield avatar Aug 23 '21 08:08 jameshadfield