sarek icon indicating copy to clipboard operation
sarek copied to clipboard

[BUG] BaseRecalibrator errors

Open jacorvar opened this issue 4 years ago • 9 comments

nextflow.log

Check Documentation

I have checked the following places for your error:

Description of the bug

The pipeline stops at BaseRecalibrator step.

Steps to reproduce

Steps to reproduce the behaviour:

  1. Command line: nextflow run nf-core/sarek -r 2.7.1 --cpus 100 --max_cpus 100 --max_memory 120.GB --input wt_etop.vs.wt_unt.tsv -profile singularity -resume
  2. See error:
-[nf-core/sarek] Pipeline completed with errors-
Error executing process > 'BaseRecalibrator (lib1-lib1-chr22_18339130-18433513)'

Caused by:
  Process `BaseRecalibrator (lib1-lib1-chr22_18339130-18433513)` terminated with an error exit status (135)

Command executed:

  gatk --java-options -Xmx7g         BaseRecalibrator         -I lib1.md.bam         -O chr22_18339130-18433513_lib1.recal.table         --tmp-dir .         -R Homo_sapiens_assembly38.fasta         -L chr22_18339130-18433513.bed         --known-sites dbsnp_146.hg38.vcf.gz         --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz         --verbosity INFO

Command exit status:
  135

Command output:
  (empty)

Command error:
  .command.sh: line 2: 66505 Bus error               gatk --java-options -Xmx7g BaseRecalibrator -I lib1.md.bam -O chr22_18339130-18433513_lib1.recal.table --tmp-dir . -R Homo_sapiens_assembly38.fasta -L chr22_18339130-18433513.bed --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz --verbosity INFO

Work dir:
  /home/user/sarek_test/src/work/40/723898390643e1286d518afd5e6d68

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Expected behaviour

It works without errors in Ubuntu 20.04 with singularity 3.6.4 on a laptop with an i5 processor.

Log files

Have you provided the following extra information/files:

  • [ x ] The command used to run the pipeline
  • [ x ] The .nextflow.log file

System

  • Hardware: HPC, a compute node equipped with AMD ryzen processor.
  • Executor: an interactive shell through slurm (srun -c 120 --mem 240G -t 12:00:00 --pty /bin/bash).
  • OS: CentOS
  • Version: 7.9.2009

Nextflow Installation

  • Version: 21.04.1.5556

Container engine

  • Engine: Singularity
  • version: 3.7.0
  • Image tag: nfcore/sarek:2.7.1

Additional context

Singularity and Java are installed through Spack.

jacorvar avatar Sep 28 '21 21:09 jacorvar

@nf-core/core any idea on this 66505 Bus error?

maxulysse avatar Sep 29 '21 09:09 maxulysse

The bus error would just be java crashing.

@jacorvar do I understand correctly that this is the local executor in a single node slurm job where you reverse 240 Gbyte of RAM (and ask for 120 cores, I assume this is e.g. EPYC)?

Normally crashes like these are related to memory being readily available on a system level, but I assume you would have that if that's the case.

Does it crash like this for this step consistently or just every now and then?

pontus avatar Sep 29 '21 09:09 pontus

The bus error would just be java crashing.

@jacorvar do I understand correctly that this is the local executor in a single node slurm job where you reverse 240 Gbyte of RAM (and ask for 120 cores, I assume this is e.g. EPYC)?

Exactly.

Normally crashes like these are related to memory being readily available on a system level, but I assume you would have that if that's the case.

Does it crash like this for this step consistently or just every now and then?

Well, I've just executed the pipeline again and it complains this time about a different issue:

Error executing process > 'BaseRecalibrator (lib1-lib1-chr13_18408107-86202979)'

Caused by:
  Process `BaseRecalibrator (lib1-lib1-chr13_18408107-86202979)` terminated with an error exit status (249)

Command executed:

  gatk --java-options -Xmx7g         BaseRecalibrator         -I lib1.md.bam         -O chr13_18408107-86202979_lib1.recal.table         --tmp-dir .         -R Homo_sapiens_assembly38.fasta         -L chr13_18408107-86202979.bed         --known-sites dbsnp_146.hg38.vcf.gz         --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz         --verbosity INFO

Command exit status:
  249

Command output:
  (empty)

Command error:
  Using GATK jar /opt/conda/envs/nf-core-sarek-2.7.1/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx7g -jar /opt/conda/envs/nf-core-sarek-2.7.1/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar BaseRecalibrator -I lib1.md.bam -O chr13_18408107-86202979_lib1.recal.table --tmp-dir . -R Homo_sapiens_assembly38.fasta -L chr13_18408107-86202979.bed --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz --verbosity INFO

nextflow.log

jacorvar avatar Sep 29 '21 11:09 jacorvar

Ok, IIRC the nextflow local executor does not try to keep track of/enforce resource usage/availability. I'd recommend running with the slurm executor.

Alternatively, you might get these runs to go through, in that case I'd recommend using -resume as well to benefit of what goes through.

pontus avatar Sep 29 '21 12:09 pontus

Are there no other alternatives? I tried other pipelines from nf-core and got no errors on my current HPC. Besides, it's only the BaseRecalibrator step which complains, while everything works like charm on my Ubuntu laptop.

Regarding the executor, should it matter if I specify the amount of cpus (--cpus and --max_cpus) the pipeline is allowed to use?

jacorvar avatar Oct 04 '21 11:10 jacorvar

Are there no other alternatives? I tried other pipelines from nf-core and got no errors on my current HPC. Besides, it's only the BaseRecalibrator step which complains, while everything works like charm on my Ubuntu laptop.

We're currently refactoring the pipeline, and if I remember well the GATK best practices are updating as well, so this will change too.

Regarding the executor, should it matter if I specify the amount of cpus (--cpus and --max_cpus) the pipeline is allowed to use?

It should matter only if it goes overboard. Maybe if there is such an resource availability issue you could try restricting resource for this step within a custom config file using selectors for this process. Maybe adding a maxForks directive to ensure that not too many concurent jobs are launched could be an idea, especially if you're using the local executor.

maxulysse avatar Oct 04 '21 12:10 maxulysse

After using the slurm executor and --no_intervals option there are less errors specifically at the prepare_recalibration step. when running it individually. However, I've noticed sarek does not cache at all the output from BaseRecalibrator. Should I use the option --save_bam_mapped to enable this caching?

jacorvar avatar Jun 07 '22 14:06 jacorvar

Should I use the option --save_bam_mapped to enable this caching?

--save_bam_mapped should only save bams produced after mapping, so will no enable any cache here.

maxulysse avatar Jun 07 '22 14:06 maxulysse

Is there anything I can do to enable caching at that step? I'm running the following inside a script with slurm:

export NXF_OPTS='-Xms1g -Xmx4g'
export NXF_EXECUTOR=slurm
ulimit -u 4126507
ulimit -c unlimited

nextflow -log nf.log run -w work nf-core/sarek -r 2.7.1 -profile singularity --cpus 350 --max_cpus 16 --max_memory '128.GB' --outdir results_test --tools freebayes,haplotypecaller,manta,mpileup,strelka,tiddit --input results_test/Preprocessing/TSV/duplicates_marked_no_table.tsv --step prepare_recalibration --no_intervals -resume

Could it be relate to this: https://github.com/Sage-Bionetworks-Workflows/sarek/pull/2

jacorvar avatar Jun 07 '22 15:06 jacorvar