sarek icon indicating copy to clipboard operation
sarek copied to clipboard

Can't download VEP cache from behind proxy [BUG]

Open pappewaio opened this issue 3 years ago • 1 comments

Check Documentation

I have checked the following places for your error:

Description of the bug

There is a proxy at the HPC I am at, which makes it hard to connect to external sources, for example when trying to download the cache for SnpEff and VEP, using the ./download_cache.nf script.

I successfully got SnpEff to work by exporting some variables using the .conf file:

# Add variable to my local execution directory nextflow.conf (i.e., not in the SAREK root)
singularity.envWhitelist = 'SINGULARITYENV_JAVA_TOOL_OPTIONS' > nextflow.conf

# Then export the variable
export SINGULARITYENV_JAVA_TOOL_OPTIONS="-Dhttp.proxyHost=xxxx -Dhttp.proxyPort=xxxx -Dhttps.proxyHost=xxxx -Dhttps.proxyPort=xxxx"

But for VEP, it appears not possible to forward the right flags: https://github.com/bcbio/bcbio-nextgen/issues/818

The solution

I have a change to suggest, both for me to not have to fork and modify the process of ./download_cache, and also potentially make life easier for others with proxy problems.

Why not add an option to the script for pointing to a local directory to search instead of the remote cache at ensembl. The process would then look something like this:

  vep_install \
    -a cf \
    -c . \
    -s ${species} \
    -y ${genome} \
    -u ${local_vep_cache_dir} \
    --CACHE_VERSION ${vep_cache_version} \
    --CONVERT \
    --NO_HTSLIB --NO_TEST --NO_BIOPERL --NO_UPDATE

Then running ./download_cache would look like this:

# Download the cache for VEP GRCh38(VEP version 104)
mkdir -p local_vep_cache_dir_tmp
wget -P local_vep_cache_dir_tmp ftp://ftp.ensembl.org/pub/release-104/variation/indexed_vep_cache/homo_sapiens_vep_104_GRCh38.tar.gz

nextflow \
  run repos/sarek/download_cache.nf \
    -with-singularity simgs/sarek.2.7.1.sif \
    --vep_cache annotation_cache/VEPeff_cache \
    --species homo_sapiens \
    --vep_cache_version 104 \
    --genome GRCh38 \
    --local_vep_cache_dir local_vep_cache_dir_tmp

Let me know if you think this is a good solution. In the meantime, I will prepare for a PR.

Nextflow Installation

  • Version:20.10.0

Container engine

  • Image tag: nfcore/sarek:2.7 sha256:09da1f431aebe8b61da6b989ed2adf17edd03492408d403f87d26b543bd0a365

pappewaio avatar Oct 15 '21 11:10 pappewaio

From vep_install -h

Usage:
perl INSTALL.pl [arguments]

Options
=======

-h | --help        Display this message and quit

-d | --DESTDIR     Set destination directory for API install (default = './')
--CACHE_VERSION    Set data (cache, FASTA) version to install if different from --VERSION (default = 99)
-c | --CACHEDIR    Set destination directory for cache files (default = '/home/jesgaaopen/.vep/')

-a | --AUTO        Run installer without user prompts. Use "a" (API + Faidx/htslib),
                   "l" (Faidx/htslib only), "c" (cache), "f" (FASTA), "p" (plugins) to specify
                   parts to install e.g. -a ac for API and cache
-n | --NO_UPDATE   Do not check for updates to ensembl-vep
-s | --SPECIES     Comma-separated list of species to install when using --AUTO
-y | --ASSEMBLY    Assembly name to use if more than one during --AUTO
-g | --PLUGINS     Comma-separated list of plugins to install when using --AUTO
-r | --PLUGINSDIR  Set destination directory for VEP plugins files (default = '/home/jesgaaopen/.vep/Plugins/')
-q | --QUIET       Don't write any status output when using --AUTO
-p | --PREFER_BIN  Use this if the installer fails with out of memory errors
-l | --NO_HTSLIB   Don't attempt to install Faidx/htslib
--NO_BIOPERL       Don't install BioPerl

-t | --CONVERT     Convert downloaded caches to use tabix for retrieving
                   co-located variants (requires tabix)


-u | --CACHEURL    Override default cache URL; this may be a local directory or
                   a remote (e.g. FTP) address.
-f | --FASTAURL    Override default FASTA URL; this may be a local directory or
                   a remote (e.g. FTP) address. The FASTA URL/directory must have
                   gzipped FASTA files under the following structure:
                   [species]/[dna]/

pappewaio avatar Oct 15 '21 12:10 pappewaio

Currently not providing a download script for cache and other files for annotation

maxulysse avatar Aug 26 '22 09:08 maxulysse