wget host address error
Description of the bug
I tried out the dev branch and am encountering a wget error in process NFCORE_FETCHNGS:SRA:SRA_FASTQ_FTP
The underlying error is:
wget: unable to resolve host address 'ftp.sra.ebi.ac.uk'
Getting the error for a number of SRX experiment ids that have successfully downloaded with sra-tools in the past.
I'll try to see if I can figure out the issue, but figured I'd bring it up.
Command used and terminal output
#! /bin/bash
#SBATCH --mem=8G
#SBATCH -t 6:00:00
#SBATCH -p general
#SBATCH -o var/log/fetch-%j.out
#SBATCH -e var/log/fetch-%j.err
module load nextflow
nextflow -log var/log/.fetchngs run nf-core/fetchngs -r dev \
-profile unc_longleaf \
-params-file config/fetchngs_params.yaml
Relevant files
System information
Nextflow 23.04.02 HPC slurm Singularity RHEL8 fetchngs dev
Could be intermittent network or server issues. ENA/SRA do see a lot of traffic.
I'm experiencing the same issue with wget using the dev branch, were you able to get this to work?
This is because of a problem with the Singularity container. A certain generation of containers was built with a Busybox that had a broken /etc/resolv.conf. I have reported this to the Galaxy folks who build the Singularity containers and will follow up once that is fixed.
I think the problem is the container.
$ module load singularity-ce/4.1.0
$ singularity shell depot.galaxyproject.org-singularity-wget-1.20.1.img
WARNING: Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
Singularity> wget -t 5 -nv -c -T 60 -O ERX2235404_ERR2179103_2.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/003/ERR2179103/ERR2179103_2.fastq.gz
wget: unable to resolve host address 'ftp.sra.ebi.ac.uk'
However, if I try the latest version of the container (check https://depot.galaxyproject.org/singularity/):
$ singularity pull https://depot.galaxyproject.org/singularity/wget:1.21.4
$ singularity shell wget\:1.21.4
Singularity> wget -t 5 -nv -c -T 60 -O ERX2235404_ERR2179103_2.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/003/ERR2179103/ERR2179103_2.fastq.gz
Singularity> ls ERX2235404_ERR2179103_2.fastq.gz
ERX2235404_ERR2179103_2.fastq.gz
So, I guess the solution is to instruct Nextflow to fetch the latest image in modules/local/sra_fastq_ftp/main.nf:
conda "conda-forge::wget=1.20.1"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/wget:1.20.1' :
'biocontainers/wget:1.20.1' }"
change to (conda also for consistency but I haven't test):
conda "conda-forge::wget=1.21.4"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/wget:1.21.4' :
'biocontainers/wget:1.21.4' }"
sorry, just realized that the suggested change already made it to the dev branch :p
I can confirm, updating the wget container to v.1.21.4 (with fe2756912803b988a3407586c7264578b0c147f2) fixed this issue.
Hello! I am still getting this error even with the fix suggested above. The line before the wget error has a warning regarding my singularity. Could this be part of the problem?
Command error: WARNING: Skipping mount /usr/local/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container wget: unable to resolve host address 'ftp.sra.ebi.ac.uk'
Thanks!
Dear @LKeene
I was able to get successful results when conda is used.
For example, I was using -profile conda
nextflow run nf-core/fetchngs -r 1.12.0 -profile conda --input ids.csv --outdir results_naga_test -c ibex.config
I was able to get successful results when conda is used.
A bummer that users are currently forced to use conda instead of Singularity, at least for now.
Hello everyone! π
As I mentioned in https://github.com/nf-core/fetchngs/issues/328, I hit the same wget: unable to resolve host address 'ftp.sra.ebi.ac.uk' failure on v1.12.0 and worked around it locally by applying the changes from PR #338 following @JulianFlesch suggestion (Thanks, Julian! π ).
Initial fix (DNS resolution):
- Run
nextflow pull nf-core/fetchngs - Inside
.nextflow/assets/nf-core/fetchngs/modules/local/sra_fastq_ftp/main.nf:- Bump conda + container definitions to
wget=1.21.4 - Prefix the FASTQ URLs with
ftp://so wget sees a complete URL
- Bump conda + container definitions to
Since GitHub doesn't support attaching file type .nf, I include the updated content of .nextflow/assets/nf-core/fetchngs/modules/local/sra_fastq_ftp/main.nf at the end of this message.
Additional tweaks for large batches (server throttling):
When downloading many files (e.g., 400+ IDs), ENA's FTP servers can throttle concurrent connections, causing "Error in server response. Closing." messages. To reduce these, I created a custom.config file that:
- Limits concurrent downloads (
maxForks = 6) to avoid overwhelming the server - Increases wget retries and timeouts (
-t 10 -T 120 --waitretry=30 --retry-connrefused) for more resilience - Allows more Nextflow-level retries (
maxRetries = 4) for processes that fail after wget's internal retries
I then run the pipeline with: nextflow run nf-core/fetchngs ... -c custom.config -resume (e.g., nextflow run nf-core/fetchngs -r 1.12.0 -profile singularity --input ids.csv --outdir data/raw -c custom.config -resume).
This successfully completed downloading 800+ files from my full dataset. The custom.config content is also included below.
Note: I only tested with Singularity. Hopefully, this also fixes the issue in other configuration profiles (e.g., Docker).
Until 1.13.0 lands, this manual patch seems stable. Hope it helps! π€
Note: pulling a new pipeline release will revert the edits living under .nextflow/assetsβjust reapply them if needed.
Updated main.nf file:
process SRA_FASTQ_FTP {
tag "$meta.id"
label 'process_low'
label 'error_retry'
conda "conda-forge::wget=1.21.4"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/wget:1.21.4' :
'biocontainers/wget:1.21.4' }"
input:
tuple val(meta), val(fastq)
output:
tuple val(meta), path("*fastq.gz"), emit: fastq
tuple val(meta), path("*md5") , emit: md5
path "versions.yml" , emit: versions
script:
def args = task.ext.args ?: ''
// Ensure URLs have ftp:// protocol prefix
def fastq0 = fastq[0].startsWith('ftp://') || fastq[0].startsWith('http://') || fastq[0].startsWith('https://') ? fastq[0] : "ftp://${fastq[0]}"
def fastq1 = fastq.size() > 1 ? (fastq[1].startsWith('ftp://') || fastq[1].startsWith('http://') || fastq[1].startsWith('https://') ? fastq[1] : "ftp://${fastq[1]}") : ''
if (meta.single_end) {
"""
wget \\
$args \\
-O ${meta.id}.fastq.gz \\
${fastq0}
echo "${meta.md5_1} ${meta.id}.fastq.gz" > ${meta.id}.fastq.gz.md5
md5sum -c ${meta.id}.fastq.gz.md5
cat <<-END_VERSIONS > versions.yml
"${task.process}":
wget: \$(echo \$(wget --version | head -n 1 | sed 's/^GNU Wget //; s/ .*\$//'))
END_VERSIONS
"""
} else {
"""
wget \\
$args \\
-O ${meta.id}_1.fastq.gz \\
${fastq0}
echo "${meta.md5_1} ${meta.id}_1.fastq.gz" > ${meta.id}_1.fastq.gz.md5
md5sum -c ${meta.id}_1.fastq.gz.md5
wget \\
$args \\
-O ${meta.id}_2.fastq.gz \\
${fastq1}
echo "${meta.md5_2} ${meta.id}_2.fastq.gz" > ${meta.id}_2.fastq.gz.md5
md5sum -c ${meta.id}_2.fastq.gz.md5
cat <<-END_VERSIONS > versions.yml
"${task.process}":
wget: \$(echo \$(wget --version | head -n 1 | sed 's/^GNU Wget //; s/ .*\$//'))
END_VERSIONS
"""
}
}
custom.config file (for large batches):
/*
* custom.config
* Use with: nextflow run nf-core/fetchngs ... -c custom.config -resume
*/
process {
// Let Nextflow retry failing processes up to 4 times instead of 2
withLabel: error_retry {
errorStrategy = 'retry'
maxRetries = 4
}
// Tweak the SRA_FASTQ_FTP step (wget downloads)
withName: 'NFCORE_FETCHNGS:SRA:SRA_FASTQ_FTP' {
// More forgiving wget flags
ext.args = '-t 10 -nv -c -T 120 --waitretry=30 --retry-connrefused'
}
}