nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Nextflow Java error

Open magnusmanske opened this issue 2 years ago • 3 comments

Bug report

A multi-day-running pipeline (nextflow 22.04.0 build 5697) has failed suddenly, with nextflow itself crashing. Reproduced with a separate, identical run, though it failed earlier the second time. Nextflow 21.04.1-5556 runs through on the exact same pipeline/data. (I have asked this on the Slack channel but it fizzed out)

Expected behavior and actual behavior

Expected behaviour: Pipeline runs all the way through

Actual behaviour: nextflow crashes after a while

Steps to reproduce the problem

Both code and data are private for the moment. Data is rather large, the pipeline runs for days on our compute farm. I do not have the time to reduce it to a test case.

Program output

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x000014789c86e5b7, pid=992, tid=9047
#
# JRE version: OpenJDK Runtime Environment (11.0+28) (build 11+28)
# Java VM: OpenJDK 64-Bit Server VM (11+28, mixed mode, tiered, compressed oops, serial gc, linux-amd64)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#

Environment

  • Nextflow version: 22.04.0 build 5697
  • Java version: [?]
  • Operating system: Linux
  • Bash version: (use the command $SHELL --version)

Additional context

This was started from Nextflow tower, which mandated the upgrade to version 22. Running manually with version 21 now.

magnusmanske avatar May 11 '22 13:05 magnusmanske

End of the log file:

Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/RCN26728.bamstats
/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/e7/89c5316b401b0d9206fa943fdcf1a4/RCN26728.bamstats
May-06 13:24:48.184 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 652797; id: 80295; name: all_in_one:samtools_stats (2697); status: COMPLETED; exit: 0; error: -; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/1e/ff01d5ba18088a91adcbbc1555d5bf started: 1651839726669; exited: 2022-05-06T12:24:35Z; ]
/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/1e/ff01d5ba18088a91adcbbc1555d5bf/RCN22777.bamstats
May-06 13:24:48.221 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 652801; id: 80303; name: all_in_one:samtools_stats (2698); status: COMPLETED; exit: 0; error: -; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/94/9303fb9817271ee0a422e406509f68 started: 1651839726671; exited: 2022-05-06T12:24:30Z; ]
/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/94/9303fb9817271ee0a422e406509f68/SPT34570.bamstats
May-06 13:24:48.231 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process all_in_one:merge_bams (6812) > jobId: 652960; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/9b/6c9343c7f7f29598ffb60a55236fb8
May-06 13:24:48.231 [Task submitter] INFO  nextflow.Session - [9b/6c9343] Submitted process > all_in_one:merge_bams (6812)
[9b/6c9343] Submitted process > all_in_one:merge_bams (6812)
May-06 13:24:48.266 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 652811; id: 80315; name: all_in_one:samtools_stats (2700); status: COMPLETED; exit: 0; error: -; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/9c/a33e07aee9783a2d7d4be64515ba45 started: 1651839726676; exited: 2022-05-06T12:24:36Z; ]
/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/9c/a33e07aee9783a2d7d4be64515ba45/SPT36517.bamstats
May-06 13:24:48.335 [Actor Thread 4817] INFO  nextflow.Nextflow - Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/RCN22777.bamstats
Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/RCN22777.bamstats
May-06 13:24:48.348 [Actor Thread 4817] INFO  nextflow.Nextflow - Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT34570.bamstats
Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT34570.bamstats
May-06 13:24:48.354 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process all_in_one:part_until_merge:cleanup_bqsr (6813) > jobId: 652961; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/ba/b2ae93e925ceaeacf3577d4ebf4cca
May-06 13:24:48.354 [Task submitter] INFO  nextflow.Session - [ba/b2ae93] Submitted process > all_in_one:part_until_merge:cleanup_bqsr (6813)
[ba/b2ae93] Submitted process > all_in_one:part_until_merge:cleanup_bqsr (6813)
May-06 13:24:48.359 [Actor Thread 4817] INFO  nextflow.Nextflow - Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT36517.bamstats
Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT36517.bamstats
May-06 13:24:48.371 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 652840; id: 80331; name: all_in_one:samtools_stats (2701); status: COMPLETED; exit: 0; error: -; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/cd/9ce135cbc42225852b21b6cf03ac37 started: 1651839771132; exited: 2022-05-06T12:24:24Z; ]
/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/cd/9ce135cbc42225852b21b6cf03ac37/SPT34610.bamstats
May-06 13:24:48.390 [Actor Thread 4817] INFO  nextflow.Nextflow - Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT34610.bamstats
Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT34610.bamstats
May-06 13:24:48.487 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process all_in_one:merge_bams (6813) > jobId: 652962; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/69/7e57f62f9f796893bf5398a00c9b63
May-06 13:24:48.487 [Task submitter] INFO  nextflow.Session - [69/7e57f6] Submitted process > all_in_one:merge_bams (6813)
[69/7e57f6] Submitted process > all_in_one:merge_bams (6813)
May-06 13:24:48.609 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process all_in_one:cleanup_merge (2718) > jobId: 652963; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/87/b2be3d36dca8da81567356eeafd72d
May-06 13:24:48.610 [Task submitter] INFO  nextflow.Session - [87/b2be3d] Submitted process > all_in_one:cleanup_merge (2718)
[87/b2be3d] Submitted process > all_in_one:cleanup_merge (2718)
May-06 13:24:48.737 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process all_in_one:bam_index (2718) > jobId: 652964; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/f6/7299d953f741210c73c96f64d1df85
May-06 13:24:48.737 [Task submitter] INFO  nextflow.Session - [f6/7299d9] Submitted process > all_in_one:bam_index (2718)
[f6/7299d9] Submitted process > all_in_one:bam_index (2718)
May-06 13:24:48.858 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process all_in_one:gatk_haplotype_caller_gatk4 (2718) > jobId: 652965; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/27/0a4e4884a301cb1e2c1b489ff91a7d
May-06 13:24:48.858 [Task submitter] INFO  nextflow.Session - [27/0a4e48] Submitted process > all_in_one:gatk_haplotype_caller_gatk4 (2718)
[27/0a4e48] Submitted process > all_in_one:gatk_haplotype_caller_gatk4 (2718)
May-06 13:24:48.976 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process all_in_one:callable_loci_gatk3 (2718) > jobId: 652966; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/cc/30eb949cbd9afddc5b41429db8f54a
May-06 13:24:48.976 [Task submitter] INFO  nextflow.Session - [cc/30eb94] Submitted process > all_in_one:callable_loci_gatk3 (2718)
[cc/30eb94] Submitted process > all_in_one:callable_loci_gatk3 (2718)
May-06 13:24:50.292 [Actor Thread 4025] INFO  nextflow.Nextflow - Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT45007.bam
Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT45007.bam
May-06 13:24:52.570 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 647483; id: 79537; name: upload_bam_sample_to_s3 (2589); status: COMPLETED; exit: 0; error: -; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/ff/aa28ab638d9dd1ddbfaa04d9f7522c started: 1651839605414; exited: 2022-05-06T12:24:35Z; ]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x000014789c86e5b7, pid=992, tid=9047
#
# JRE version: OpenJDK Runtime Environment (11.0+28) (build 11+28)
# Java VM: OpenJDK 64-Bit Server VM (11+28, mixed mode, tiered, compressed oops, serial gc, linux-amd64)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
May-06 13:24:52.652 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 647501; id: 79557; name: all_in_one:bam_index (2592); status: COMPLETED; exit: 0; error: -; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/65/d2f8c709ab2019dbb1a231cb88be58 started: 1651839609637; exited: 2022-05-06T12:24:51Z; ]
# An error report file with more information is saved as:
# /lustre/scratch118/infgen/gsu/prod/nextflow-tower/hs_err_pid992.log
May-06 13:24:52.654 [Actor Thread 4836] INFO  nextflow.Nextflow - Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT38657.bam.bai
Copying to /lustre/scratch118/infgen/gsu/prod/nextflow-tower/output_pf7_1_rerun/SPT38657.bam.bai
/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/65/d2f8c709ab2019dbb1a231cb88be58/SPT38657.bam.bai
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
May-06 13:24:52.783 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 647614; id: 79665; name: all_in_one:bam_index (2606); status: COMPLETED; exit: 0; error: -; workDir: /lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/6e/7dfa8132b204270e0385bfd1837258 started: 1651839672207; exited: 2022-05-06T12:24:44Z; ]
/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/6e/7dfa8132b204270e0385bfd1837258/RCN25264.bam.bai
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
/tmp/1651582662.309097.shell: line 110:   992 Aborted                 nextflow run https\://gitlab.internal.sanger.ac.uk/sanger-pathogens/pipelines/pf7.x -name infallible_blackwell -params-file https\://nf-tower.cellgeni.sanger.ac.uk/api/ephemeral/QRsJvw0oNU9sBJAwzKB9Uw.json -with-tower https\://nf-tower.cellgeni.sanger.ac.uk/api -r bqsr_gatk3 -profile sanger_lsf -latest > >(tee -a $NXF_OUT_FILE) 2>&1

------------------------------------------------------------
Sender: LSF System <lsfadmin@node-11-5-3>
Subject: Job 309097: <nf-workflow-3U1nxgEkL6IrmG> in cluster <farm5> Exited

Job <nf-workflow-3U1nxgEkL6IrmG> was submitted from host <farm5-head1> by user <malpipe> in cluster <farm5> at Tue May  3 13:57:42 2022
Job was executed on host(s) <node-11-5-3>, in queue <basement>, as user <malpipe> in cluster <farm5> at Tue May  3 13:57:42 2022
</nfs/users/nfs_m/malpipe> was used as the home directory.
</lustre/scratch118/infgen/gsu/prod/nextflow-tower> was used as the working directory.
Started at Tue May  3 13:57:42 2022
Terminated at Fri May  6 13:24:56 2022
Results reported at Fri May  6 13:24:56 2022

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -o /lustre/scratch118/infgen/gsu/prod/nextflow-tower/nf-3U1nxgEkL6IrmG.log
#BSUB -q basement
#BSUB -J nf-workflow-3U1nxgEkL6IrmG
#BSUB -R "select[mem>8000] rusage[mem=8000]"
#BSUB -M 8000
set -e
set -o pipefail

# Input variables:
#
# - NXF_UUID: nextflow session id generated by tower
# - NXF_LOG_FILE: nextflow log file name
# - NXF_OUT_FILE: nextflow output file name
# - NXF_IGNORE_RESUME_HISTORY: do not stop for missing nextflow history file
# - NXF_CONFIG_BASE64: nextflow config file encoded as base64 string
# - NXF_SCM_BASE64: nextflow scm file encoded as base64 string
# - NXF_DEBUG: enable debugging mode
# - TOWER_ACCESS_TOKEN: Tower access token
# - TOWER_WORKFLOW_ID: Workflow ID generated by Tower
# - TOWER_CONFIG_BASE64: tower config file encoded as base64 string
# - TOWER_CONFIG_FILE: tower config file name

export NXF_IGNORE_RESUME_HISTORY=true
export NXF_WORK=/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71
export NXF_EXIT_FILE=nf-3U1nxgEkL6IrmG.exit
export TOWER_WORKSPACE_ID=202472793496868
export NXF_CONFIG_FILE=nf-3U1nxgEkL6IrmG.config
export NXF_OUT_FILE=nf-3U1nxgEkL6IrmG.txt
export NXF_ASSETS=/lustre/scratch118/infgen/gsu/prod/nextflow-tower/work_pf71/.nextflow/pipelines/4c49fad3
export NXF_UUID=20433c60-1bc5-494c-a71a-7c5d7570d630
export NXF_TML_FILE=timeline-3U1nxgEkL6IrmG.html
export TOWER_WORKFLOW_ID=3U1nxgEkL6IrmG
export NXF_DEFAULT_DSL=1
export NXF_ANSI_LOG=false
export NXF_PLUGINS_DEFAULT=nf-tower
export NXF_PRERUN_BASE64=bW9kdWxlIGxvYWQgbmV4dGZsb3cvMjIuMDQuMC01Njk3IElTRy9zaW5ndWxhcml0eS8zLjYuNApleHBvcnQgSFRUUF9QUk9YWT0iaHR0cDovL3d3d2NhY2hlLnNhbmdlci5hYy51azozMTI4IgpleHBvcnQgSFRUUFNfUFJPWFk9JEhUVFBfUFJPWFkKZXhwb3J0IFNJTkdVTEFSSVRZX0RPQ0tFUl9VU0VSTkFNRT1tYWxwaXBlLXRlc3QKZXhwb3J0IFNJTkdVTEFSSVRZX0RPQ0tFUl9QQVNTV09SRD03bVZ5cWtDd3ItNWNKWTFZYXhxMQpleHBvcnQgTlhGX09QVFM9Ii1YbXg4RyIKZXhwb3J0IFRPV0VSX0FDQ0VTU19UT0tFTj1leUpoYkdjaU9pSklVekkxTmlKOS5leUp6ZFdJaU9pSTRJaXdpYm1KbUlqb3hOalV4TlRneU5qVTBMQ0p5YjJ4bGN5STZXeUoxYzJWeUlsMHNJbWx6Y3lJNkluUnZkMlZ5TFdKaFkydGxibVFpTENKbGVIQWlPakUyTlRFMU9EWXlOVFFzSW1saGRDSTZNVFkxTVRVNE1qWTFOSDAuSkdSUkFBb1hkQzZVWUUybDBRMFRUaTk4c1h0SGRjSFhtMURlM3dfanNrRQpleHBvcnQgVE9XRVJfUkVGUkVTSF9UT0tFTj1leUpoYkdjaU9pSklVekkxTmlKOS5ZV015TnpBeU5HRXROekEzTXkwMFlqWTRMVGc0TlRBdE9HRmtZMk0zWkdFek56UTUuUGhLLWFISGoxZkthekhyS2xoSnpoR3pSdk5xQnNWVWViR3RSZHpYU1hsVQpleHBvcnQgTlhGX1NDTV9GSUxFPWh0dHBzOi8vbmYtdG93ZXIuY2VsbGdlbmkuc2FuZ2VyLmFjLnVrL2FwaS9lcGhlbWVyYWwvNHh6S0c0Sk4xOTVyLVdRWGVMVU9mQQo=
export NXF_LOG_FILE=nf-3U1nxgEkL6IrmG.log
export NXF_CONFIG_BASE64=dGltZWxpbmUuZW5hYmxlZCA9IHRydWUKdGltZWxpbmUuZmlsZSA9ICIkTlhGX1RNTF9GSUxFIgpwcm9jZXNzLmV4ZWN1dG9yID0gJ2xzZicKcHJvY2Vzcy5xdWV1ZSA9ICdsb25nJwp3b3JrRGlyID0gJy9sdXN0cmUvc2NyYXRjaDExOC9pbmZnZW4vZ3N1L3Byb2QvbmV4dGZsb3ctdG93ZXIvd29ya19wZjcxJwo=

[[ $NXF_DEBUG ]] && (env | sort) && set -x
cache_path=".nextflow/cache/$NXF_UUID"

function save_exit() {
   [[ $NXF_EXIT_FILE ]] && printf $1 > $NXF_EXIT_FILE
}

function pre_run() {
    if [[ $NXF_PRERUN_BASE64 ]]; then
      source /dev/stdin <<<"$(cat <(echo $NXF_PRERUN_BASE64 | base64 -d))" > >(tee -a $NXF_OUT_FILE) 2>&1

(... more ...)
------------------------------------------------------------

Exited with exit code 134.

Resource usage summary:

    CPU time :                                   46755.00 sec.
    Max Memory :                                 2304 MB
    Average Memory :                             1524.16 MB
    Total Requested Memory :                     8000.00 MB
    Delta Memory :                               5696.00 MB
    Max Swap :                                   -
    Max Processes :                              18
    Max Threads :                                67
    Run time :                                   257240 sec.
    Turnaround time :                            257234 sec.

The output (if any) is above this job summary.

magnusmanske avatar May 11 '22 13:05 magnusmanske

I am seeing this error also running a pipeline in nf tower with version 22.04.0 build 5697

Sanger-ad7 avatar May 13 '22 09:05 Sanger-ad7

Also getting this one now with version 21.04.1.5556

magnusmanske avatar May 16 '22 08:05 magnusmanske

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 16 '23 01:01 stale[bot]