mpox
mpox copied to clipboard
workflow: give input data files unique names
Description of proposed changes
I'm running into a problem when running multiple config files on different input data (ex. hmpxv1 vs. mpxv, or Nextstrain vs. LAPIS). Since the input data is hard-coded to data/sequences.fasta
and data/metadata.tsv
it makes it difficult to run different inputs without conflict.
One option could be to add the {build_name} into input filenames to make them unique. This is an example of the changes I've made:
rule download:
message: "Downloading sequences and metadata from data.nextstrain.org"
output:
sequences = "data/{build_name}_sequences.fasta.xz",
metadata = "data/{build_name}_metadata.tsv.gz"
...
Testing
Running the following commands will produce distinct outputs in data
and results
:
snakemake -c 1 results/mpxv/filtered.fasta --configfile config/config_mpxv.yaml
snakemake -c 1 results/hmpxv1/filtered.fasta --configfile config/config_hmpxv1.yaml
- data:
- hmpxv1_metadata.tsv
- hmpxv1_metadata.tsv.gz
- hmpxv1_sequences.fasta
- hmpxv1_sequences.fasta.xz
- mpxv_metadata.tsv
- mpxv_metadata.tsv.gz
- mpxv_sequences.fasta
- mpxv_sequences.fasta.xz
- results:
- hmpxv1/
- hmpxv1_metadata.tsv
- mpxv/
- mpxv_metadata.tsv
To compare different data sources, I add the data source into the build name. For example
#config_hmpxv1_nextstrain.yaml
build_name: "hmpxv1_nextstrain"
auspice_name: "monkeypox_hmpxv1_nextstrain"
#config_hmpxv1_lapis.yaml
build_name: "hmpxv1_lapis"
auspice_name: "monkeypox_hmpxv1_lapis"
I quite like this approach, since it mirrors the output structure of https://github.com/nextstrain/ncov. But I would love to know more about how you're implementing multiple "builds", without invoking the full input/build logic from the ncov pipeline. Thanks!
Can you share the workflow in which you're having issues with input files? I think data/
is supposed to contain all sequences. I could imagine naming them lapis_sequences.fasta
and gisaid_sequences.fasta
etc., but giving them names by builds would be confusing - maybe I don't understand the problem you're having.
That actually clarifies things a lot, thanks! Is my understanding of the current workflow correct:
-
data/sequences.fasta
should contain all possible sequences.- Which might include sequences from lapis, gisaid, local assemblies, etc
- A build is specified with
config_{build_name}.yaml
, and customized withfilter
options, example:## filter min_date: 2017 min_length: 10000 filters: "--exclude-where clade!=hMPXV-1"
- If I just wanted to make a
lapis
+local
sequences build, maybe I could make adata_source
column indata/metadata.tsv
, and then do something like:min_date: 2017 min_length: 10000 filters: --query "(data_source == 'lapis') | (data_source == 'local')"