sarek
sarek copied to clipboard
[BUG] BaseRecalibrator errors
Check Documentation
I have checked the following places for your error:
Description of the bug
The pipeline stops at BaseRecalibrator step.
Steps to reproduce
Steps to reproduce the behaviour:
- Command line:
nextflow run nf-core/sarek -r 2.7.1 --cpus 100 --max_cpus 100 --max_memory 120.GB --input wt_etop.vs.wt_unt.tsv -profile singularity -resume - See error:
-[nf-core/sarek] Pipeline completed with errors-
Error executing process > 'BaseRecalibrator (lib1-lib1-chr22_18339130-18433513)'
Caused by:
Process `BaseRecalibrator (lib1-lib1-chr22_18339130-18433513)` terminated with an error exit status (135)
Command executed:
gatk --java-options -Xmx7g BaseRecalibrator -I lib1.md.bam -O chr22_18339130-18433513_lib1.recal.table --tmp-dir . -R Homo_sapiens_assembly38.fasta -L chr22_18339130-18433513.bed --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz --verbosity INFO
Command exit status:
135
Command output:
(empty)
Command error:
.command.sh: line 2: 66505 Bus error gatk --java-options -Xmx7g BaseRecalibrator -I lib1.md.bam -O chr22_18339130-18433513_lib1.recal.table --tmp-dir . -R Homo_sapiens_assembly38.fasta -L chr22_18339130-18433513.bed --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz --verbosity INFO
Work dir:
/home/user/sarek_test/src/work/40/723898390643e1286d518afd5e6d68
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Expected behaviour
It works without errors in Ubuntu 20.04 with singularity 3.6.4 on a laptop with an i5 processor.
Log files
Have you provided the following extra information/files:
- [ x ] The command used to run the pipeline
- [ x ] The
.nextflow.logfile
System
- Hardware: HPC, a compute node equipped with AMD ryzen processor.
- Executor: an interactive shell through slurm (
srun -c 120 --mem 240G -t 12:00:00 --pty /bin/bash). - OS: CentOS
- Version: 7.9.2009
Nextflow Installation
- Version: 21.04.1.5556
Container engine
- Engine: Singularity
- version: 3.7.0
- Image tag: nfcore/sarek:2.7.1
Additional context
Singularity and Java are installed through Spack.
@nf-core/core any idea on this 66505 Bus error?
The bus error would just be java crashing.
@jacorvar do I understand correctly that this is the local executor in a single node slurm job where you reverse 240 Gbyte of RAM (and ask for 120 cores, I assume this is e.g. EPYC)?
Normally crashes like these are related to memory being readily available on a system level, but I assume you would have that if that's the case.
Does it crash like this for this step consistently or just every now and then?
The bus error would just be java crashing.
@jacorvar do I understand correctly that this is the local executor in a single node slurm job where you reverse 240 Gbyte of RAM (and ask for 120 cores, I assume this is e.g. EPYC)?
Exactly.
Normally crashes like these are related to memory being readily available on a system level, but I assume you would have that if that's the case.
Does it crash like this for this step consistently or just every now and then?
Well, I've just executed the pipeline again and it complains this time about a different issue:
Error executing process > 'BaseRecalibrator (lib1-lib1-chr13_18408107-86202979)'
Caused by:
Process `BaseRecalibrator (lib1-lib1-chr13_18408107-86202979)` terminated with an error exit status (249)
Command executed:
gatk --java-options -Xmx7g BaseRecalibrator -I lib1.md.bam -O chr13_18408107-86202979_lib1.recal.table --tmp-dir . -R Homo_sapiens_assembly38.fasta -L chr13_18408107-86202979.bed --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz --verbosity INFO
Command exit status:
249
Command output:
(empty)
Command error:
Using GATK jar /opt/conda/envs/nf-core-sarek-2.7.1/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx7g -jar /opt/conda/envs/nf-core-sarek-2.7.1/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar BaseRecalibrator -I lib1.md.bam -O chr13_18408107-86202979_lib1.recal.table --tmp-dir . -R Homo_sapiens_assembly38.fasta -L chr13_18408107-86202979.bed --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz --verbosity INFO
Ok, IIRC the nextflow local executor does not try to keep track of/enforce resource usage/availability. I'd recommend running with the slurm executor.
Alternatively, you might get these runs to go through, in that case I'd recommend using -resume as well to benefit of what goes through.
Are there no other alternatives? I tried other pipelines from nf-core and got no errors on my current HPC. Besides, it's only the BaseRecalibrator step which complains, while everything works like charm on my Ubuntu laptop.
Regarding the executor, should it matter if I specify the amount of cpus (--cpus and --max_cpus) the pipeline is allowed to use?
Are there no other alternatives? I tried other pipelines from nf-core and got no errors on my current HPC. Besides, it's only the
BaseRecalibratorstep which complains, while everything works like charm on my Ubuntu laptop.
We're currently refactoring the pipeline, and if I remember well the GATK best practices are updating as well, so this will change too.
Regarding the executor, should it matter if I specify the amount of cpus (
--cpusand--max_cpus) the pipeline is allowed to use?
It should matter only if it goes overboard. Maybe if there is such an resource availability issue you could try restricting resource for this step within a custom config file using selectors for this process. Maybe adding a maxForks directive to ensure that not too many concurent jobs are launched could be an idea, especially if you're using the local executor.
After using the slurm executor and --no_intervals option there are less errors specifically at the prepare_recalibration step. when running it individually.
However, I've noticed sarek does not cache at all the output from BaseRecalibrator. Should I use the option --save_bam_mapped to enable this caching?
Should I use the option
--save_bam_mappedto enable this caching?
--save_bam_mapped should only save bams produced after mapping, so will no enable any cache here.
Is there anything I can do to enable caching at that step? I'm running the following inside a script with slurm:
export NXF_OPTS='-Xms1g -Xmx4g'
export NXF_EXECUTOR=slurm
ulimit -u 4126507
ulimit -c unlimited
nextflow -log nf.log run -w work nf-core/sarek -r 2.7.1 -profile singularity --cpus 350 --max_cpus 16 --max_memory '128.GB' --outdir results_test --tools freebayes,haplotypecaller,manta,mpileup,strelka,tiddit --input results_test/Preprocessing/TSV/duplicates_marked_no_table.tsv --step prepare_recalibration --no_intervals -resume
Could it be relate to this: https://github.com/Sage-Bionetworks-Workflows/sarek/pull/2