snakemake-wrappers icon indicating copy to clipboard operation
snakemake-wrappers copied to clipboard

feat: Add configurable protocol support to ensembl reference download

Open pettyalex opened this issue 1 year ago • 8 comments
trafficstars

This PR adds support for configurable protocol, ftp, http, or https to ensembl reference data download and vep cache download.

I was evaluating snakemake-wrapper and widely used workflows as a tool for my group to potentially use, and immediately ran into our firewall rules. By default we're able to make outgoing HTTP and HTTPS requests, but not FTP. I can request a change to our firewall, but it also would be helpful if these rules that download reference files were able to download them over HTTP, as ftp.ensembl.org has has been available over HTTP for a very long time.

If you'd prefer, I can create an issue to track this. I also intend to update test cases on these wrappers, and if you are happy with this pattern I could apply it across all wrappers in this repository that are currently hard-coded to make ftp requests.

QC

  • [x] I confirm that:

For all wrappers added by this PR,

  • there is a test case which covers any introduced changes,
  • input: and output: file paths in the resulting rule can be changed arbitrarily,
  • either the wrapper can only use a single core, or the example rule contains a threads: x statement with x being a reasonable default,
  • rule names in the test case are in snake_case and somehow tell what the rule is about or match the tools purpose or name (e.g., map_reads for a step that maps reads),
  • all environment.yaml specifications follow the respective best practices,
  • the environment.yaml pinning has been updated by running snakedeploy pin-conda-envs environment.yaml on a linux machine,
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:),
  • all fields of the example rules in the Snakefiles and their entries are explained via comments (input:/output:/params: etc.),
  • stderr and/or stdout are logged correctly (log:), depending on the wrapped tool,
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to (see here; this also means that using any Python tempfile default behavior works),
  • the meta.yaml contains a link to the documentation of the respective tool or command,
  • Snakefiles pass the linting (snakemake --lint),
  • Snakefiles are formatted with snakefmt,
  • Python wrapper scripts are formatted with black.
  • Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

Summary by CodeRabbit

  • New Features

    • Enhanced data retrieval workflows now support dynamic protocol selection, enabling secure downloads (HTTPS/HTTP) for annotation, genome, variation, and cache data.
    • Added a workflow step for processing multiple chromosomes in a single operation.
  • Tests

    • Expanded the test suite to validate and ensure reliability of the new secure protocol-based workflows.

pettyalex avatar Feb 15 '24 22:02 pettyalex