mpox icon indicating copy to clipboard operation
mpox copied to clipboard

workflow: give input data files unique names

Open ktmeaton opened this issue 2 years ago • 2 comments

Description of proposed changes

I'm running into a problem when running multiple config files on different input data (ex. hmpxv1 vs. mpxv, or Nextstrain vs. LAPIS). Since the input data is hard-coded to data/sequences.fasta and data/metadata.tsv it makes it difficult to run different inputs without conflict.

One option could be to add the {build_name} into input filenames to make them unique. This is an example of the changes I've made:

rule download:
    message: "Downloading sequences and metadata from data.nextstrain.org"
    output:
        sequences = "data/{build_name}_sequences.fasta.xz",
        metadata = "data/{build_name}_metadata.tsv.gz"
	...

Testing

Running the following commands will produce distinct outputs in data and results:

snakemake -c 1 results/mpxv/filtered.fasta --configfile config/config_mpxv.yaml
snakemake -c 1 results/hmpxv1/filtered.fasta --configfile config/config_hmpxv1.yaml
  • data:
    • hmpxv1_metadata.tsv
    • hmpxv1_metadata.tsv.gz
    • hmpxv1_sequences.fasta
    • hmpxv1_sequences.fasta.xz
    • mpxv_metadata.tsv
    • mpxv_metadata.tsv.gz
    • mpxv_sequences.fasta
    • mpxv_sequences.fasta.xz
  • results:
    • hmpxv1/
    • hmpxv1_metadata.tsv
    • mpxv/
    • mpxv_metadata.tsv

To compare different data sources, I add the data source into the build name. For example

#config_hmpxv1_nextstrain.yaml
build_name: "hmpxv1_nextstrain"
auspice_name: "monkeypox_hmpxv1_nextstrain"
#config_hmpxv1_lapis.yaml
build_name: "hmpxv1_lapis"
auspice_name: "monkeypox_hmpxv1_lapis"

I quite like this approach, since it mirrors the output structure of https://github.com/nextstrain/ncov. But I would love to know more about how you're implementing multiple "builds", without invoking the full input/build logic from the ncov pipeline. Thanks!

ktmeaton avatar Jun 28 '22 18:06 ktmeaton

Can you share the workflow in which you're having issues with input files? I think data/ is supposed to contain all sequences. I could imagine naming them lapis_sequences.fasta and gisaid_sequences.fasta etc., but giving them names by builds would be confusing - maybe I don't understand the problem you're having.

corneliusroemer avatar Jun 28 '22 18:06 corneliusroemer

That actually clarifies things a lot, thanks! Is my understanding of the current workflow correct:

  • data/sequences.fasta should contain all possible sequences.
    • Which might include sequences from lapis, gisaid, local assemblies, etc
  • A build is specified with config_{build_name}.yaml, and customized with filter options, example:
    ## filter
    min_date: 2017
    min_length: 10000
    filters: "--exclude-where clade!=hMPXV-1"
    
  • If I just wanted to make a lapis+local sequences build, maybe I could make a data_source column in data/metadata.tsv, and then do something like:
    min_date: 2017
    min_length: 10000
    filters: --query "(data_source == 'lapis') | (data_source == 'local')"
    

ktmeaton avatar Jun 28 '22 22:06 ktmeaton