augur subsample command
Tasks
@victorlin to fill this out
Links
- 2021-08-17 take 1
- 2021-08-23 take 2 (moved to ncov)
- 2022-01-12 discussion during Nextstrain meeting
-
2022-11-03 Re-visiting the
augur subsampleproposal - 2024-04-12 take 3
Original issue
A common use case is versatile sub-sampling of datasets to suit a particular research question. The current best example of this is the (wonderful) SARS-CoV-2 pipeline which leverages a augur filter rule, a script to calculate priorities and snakemake wizardry to allow versatile, declarative subsampling schemes to be simply and intuitively defined.
This allows a simple-to-reason-with YAML file to result in a very bespoke subsampling scheme:

The question arises: how do we do this for a different pathogen?
As the SARS-CoV-2 example leverages snakemake, one solution would be to abstract that logic into a importable snakemake rule. The alternative approach would be a new augur command augur subsample which takes a YAML file declaring the desired subsampling settings. Learning from our work on nCoV, this would essentially replace the snakemake-controlled augur filter commands with a single augur subsample command. The yaml file would look similar / identical to the current snakemake implementation. The subcommand would leverage the functions used by augur filter as well as the priorities script from nCoV.
Thoughts?
Examples
subsampling.yaml:
schemes:
switzerland:
# Focal samples for country
country:
group_by: "division year month"
max_sequences: 1500
exclude: "--exclude-where 'country!={country}'"
# Contextual samples from country's region
region:
group_by: "country year month"
seq_per_group: 20
exclude: "--exclude-where 'country={country}' 'region!={region}'"
priorities:
type: "proximity"
focus: "country"
# Contextual samples from the rest of the world,
# excluding the current region to avoid resampling.
global:
group_by: "country year month"
seq_per_group: 10
exclude: "--exclude-where 'region={region}'"
priorities:
type: "proximity"
focus: "country"
augur subsample --include <TXT> --sequences <FASTA> \
--metadata <TSV> --schemes <YAML> --output <FASTA>
After our recent conversations internally and with @dpark01 about reducing the complexity of the ncov workflow and improving the portability of the existing workflow with other workflow languages and/or platforms, I'm bumping this here as a higher priority issue and moving it from the "backlog" to the "next up".
Here is my current hack--would love to replace all that with augur subsample
It would be nice if a command like this could include emit as output a numeric count of selected samples in each deme.
PR #762 begins an implementation of augur subsample
Update: we've had internal discussions considering this again with a different YAML schema and the addition of weighted sampling (#1318).
augur subsample was released in version 31.5.0. Tracking rollout to pathogen repos in https://github.com/nextstrain/public/issues/26.