nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Nextflow - SLURM job interface disconnect

Open nickhsmith opened this issue 2 years ago • 16 comments

Bug report

I am noticing a problem with the Nextflow-SLURM interface when processing large number of jobs. For example, I am running a workflow and have spawned ~10,000 tasks (50 samples * 150 intervals/sample) + downstream tasks) using the nf-core sarek workflow. However, I notice that my tasks are hanging, and as a result, nothing is being submitted to SLURM despite the jobs being completed.

Expected behavior and actual behavior

Nextflow generates a task. Nextflow submits the task to SLURM. SLURM runs the job. SLURM finishes the job (creating .exitcode with value 0). Nextflow recognizes this and submits the next task.

This last step of recognizing a finished job seems to be failing.

Steps to reproduce the problem

More tangibly, I have used queueSize to limit my workflow to 100 jobs at 1 time. Some of those jobs will finish (.exitcode exists with the value 0 in the local work dir (work/76/69b9b4...)) BUT these jobs are still considered RUNNING when I look at the .nextflow.log file (see last photo)

Looking through the log file. I see that the SLURM job ID 636541 is submitted at 15:17 on March 2 It finishes and creates the .exitcode file (with a successful 0 exit) at 18:10. SLURM thinks the job is completed at this point. slurm finished job

but as of yesterday March 3 there is no completed status of this job, as a result SLURM won't start any of the remaining ~7000 jobs in the task scheduler)

Mar-03 05:26:48.603 [Task submitter] DEBUG n.processor.TaskPollingMonitor - %% executor slurm > tasks in the submission queue: 6714 -- tasks to be submitted are shown below

working dir nextflow log

Any help would be appreciated, but is there a reason that the completed jobs aren't being recognized as such, or what I can do to force a re-recognition? Thanks in advance

Environment

nextflow info Version: 21.10.6 build 5660 Created: 21-12-2021 16:55 UTC (17:55 CEST) System: Linux 4.18.0-348.12.2.el8_5.x86_64 Runtime: Groovy 3.0.9 on OpenJDK 64-Bit Server VM 10.0.2+13 Encoding: UTF-8 (UTF-8)

shell: bash

nickhsmith avatar Mar 04 '22 08:03 nickhsmith

I am still dealing with this problem with the nf-core/rnaseq pipeline this time, and the lack of throughput is quite frustrating. @csawye01 and I are both having the same issue. Is there anywhere I can try to look or dive into why this isn't working? Attached is my .nextflow.log file

SLURM thinks the job is done

image

NEXTFLOW thinks the job is running

image

NEXTFLOW log (edited)

nextflow_log.txt

EDIT

I tried running this pipeline with the same configuration without SLURM and it worked without a problem. But I really need it to work with SLURM in my shared compute environment.

nickhsmith avatar Mar 14 '22 11:03 nickhsmith

@pditommaso I was wondering if there has been an exploration of this problem? Based on other issues it seems like there may be some problems with the platform/slurm tag, but I haven't seen a lot of discussion of these issues.

Is there anything I can do to help push for this fix? Otherwise, the lack of throughput is severely limiting.

Please let me know, I know it would also help other users (such as @csawye01) as well.

nickhsmith avatar Mar 21 '22 08:03 nickhsmith

Hard to say anything w/o more detailed debug info. Please executing it again adding the -trace nextflow.executor option as shown below, and upload the .nextflow.log file when done:

nextflow -trace nextflow.executor run [your pipeline params]

pditommaso avatar Mar 21 '22 10:03 pditommaso

Thank you for the comment! I will upload the log file when I can. Thanks again!

nickhsmith avatar Mar 21 '22 10:03 nickhsmith

Attached is the zipped log file. I have edited the PATH names, but otherwise, this is the result after 2 days (~1 day of hanging behavior)

nextflow_trace.log.gz

nickhsmith avatar Mar 23 '22 10:03 nickhsmith

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 20 '22 20:09 stale[bot]

Hi all, I also did run into the same problem. In case you would consider it helpful, I would be happy to provide further information to help the debugging process.

ligrob avatar Nov 09 '22 13:11 ligrob

@ligrob I would love to know what you went through, and how you got around it! Thanks for responding!

nickhsmith avatar Nov 10 '22 08:11 nickhsmith

@ligrob I got the exact same problem as @nickhsmith, I did notice that I could force nextflow to resume by making a new ssh connection to the machine. If you could tell me the work around this would help me a lot. Thanks!

sanderthierens avatar Mar 10 '23 13:03 sanderthierens

I have encountered the same problem with a nf-core/rnaseq pipeline and slurm. A thread dump shows where Nextflow hangs:

"Task monitor" 38 prio=5 os_prio=0 cpu=25638.16ms elapsed=76825.33s allocated=1170M defined_classes=248 tid=0x00007f05fd2e7740 nid=0x2af5ff runnable  [0x00007f05d20c9000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.FileDispatcherImpl.read0([email protected]/Native Method)
    at sun.nio.ch.FileDispatcherImpl.read([email protected]/FileDispatcherImpl.java:48)
...
    at nextflow.executor.GridTaskHandler.readExitStatus(GridTaskHandler.groovy:371)
    at nextflow.executor.GridTaskHandler.checkIfCompleted(GridTaskHandler.groovy:452)
    at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:611)
    at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:534)

It seems that nextflow no longer recognizes completed Slurm jobs when it tries to read the file .exitcode. The Java call to read this file gets stuck and does not return. Nextflow resumes in my case when I call ls, lsof or stat for this file. As a workaround, I have placed this snippet in my shell script before calling nextflow run :

WORK_DIR="./work"
nohup lsof +D ${WORK_DIR} -r 600 &> /dev/null &
LSOF_PID=$!
trap "kill $LSOF_PID" EXIT

nextflow run ...
  -work-dir ${WORK_DIR}
  ...

Environment

  nextflow version: 23.04.1 build 5866
  Ubuntu 22.04
  Ceph storage platform
  openjdk version "17.0.7" 2023-04-18

hmehlan avatar Jul 14 '23 14:07 hmehlan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 15 '23 06:12 stale[bot]

I seem to have the same problem. And very difficult to debug because it only occurs when submitting a lot of processes and after successfully running for 4+ hours. I've tried with nextflow version 23.10.1.5891 and nextflow version 24.02.0-edge.5907 so the latest updates haven't fixed this yet. Nextflow successfully submits and monitors hundreds of jobs to SLURM, and builds up it's own queue because I set queueSize = 50 to prevent overloading SLURM. It looks like after a long time, when Nextflow misses enough jobs that have actually completed, it won't submit new jobs because it thinks the SLURM queue is full. So the pipeline works well for 4+ hours and continuously submits new processes so the queue always has 50 processes and then at some point it finishes processes without starting new ones and then only the master process is left in my SLURM queue.


Apr-10 23:19:53.341 [Task submitter] DEBUG n.processor.TaskPollingMonitor - %% executor slurm > tasks in the submission queue: 368 -- tasks to be submitted are shown below
~> TaskHandler[jobId: null; id: 418; name: fastqc_trimmed_reads (QKS3-nextseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/69/54dad65c13535f215df445e6b10c94 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 430; name: fastqc_trimmed_reads (2022187367-miniseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/c8/1669049c5a1e2070b618d0c115cf06 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 448; name: unicycler (2022187367-miseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/5a/c02d66e8ea2293624de2c324a52741 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 681; name: prokka (2022187361-miseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/2a/c7c8be53daf59e3b1936c88bb38c3e started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 683; name: fastqc_trimmed_reads (2022187369-miniseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/de/c8f00c57c95834c374b11c2796b663 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 900; name: fastqc_trimmed_reads (2022187352-miseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/f8/75da6626806692461015a62da7539a started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 901; name: unicycler (2022187352-miseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/a5/f0fe0e1734fb36d1081d2eeca19a76 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 908; name: abricate (2022187356-nextseq.fna); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/4c/3a316a1e3d5ad595312096fe3dc005 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 909; name: quast (2022187356-nextseq.fna); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/78/7d9eb68755327dbfdcc08db232f963 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 910; name: gtdbtk_classify_wf (2022187356-nextseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/7d/bdf54b108e7f3af7510e61dce2a1bb started: -; exited: -; ]
.. remaining tasks omitted.
Apr-10 23:22:27.459 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > No more task to compute -- The following nodes are still active:
[process] multiqc_raw_fastqc
  status=ACTIVE
  port 0: (value) OPEN  ; channel: fastqcs
  port 1: (cntrl) -     ; channel: $

[process] multiqc_trimmed_fastqc
  status=ACTIVE
  port 0: (value) OPEN  ; channel: fastqcs
  port 1: (value) bound ; channel: trim_logs
  port 2: (cntrl) -     ; channel: $

[process] prokka
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

...

But when I go to the working directory of the jobs that it hasn't recognized as COMPLETED, the .exitcode file exists and contains a 0. I tried @hmehlan approach of calling ls, lsof or stat on this directory/file but it doesn't do anything for me. The only thing that works is killing the master process and restarting it with -resume. Then it recognizes the progress and finishes the pipeline smoothly but it's not ideal to be restarting every pipeline at different times.

Here an example of a process that was submitted by Nextflow but never recognized as completed by Nextflow:

Relevant snippet from .nextflow.log

$ cat .nextflow.log | grep "13900461"
Apr-10 21:07:46.869 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [SLURM] submitted process indexRemapping (2022187387-miniseq) > jobId: 13900461; workDir: /PATH/REDACTED/20240409_Validation/retry_nextflow_24.02.0/work/92/7fa32e385dcc35628c53d2aa515e00

and the content of the working directiory

$ cd ./work/92/7fa32e385dcc35628c53d2aa515e00
$ ls -lha

total 10M
drwxrwsr-x 2 progal REDACTED  10M Apr 10 21:08 .
drwxrwsr-x 9 progal REDACTED 358M Apr 10 21:06 ..
-rw-rw-r-- 1 progal REDACTED    0 Apr 10 21:08 .command.begin
-rw-rw-r-- 1 progal REDACTED  34K Apr 10 21:08 .command.err
-rw-rw-r-- 1 progal REDACTED 132K Apr 10 21:08 .command.log
-rw-rw-r-- 1 progal REDACTED    0 Apr 10 21:08 .command.out
-rw-rw-r-- 1 progal REDACTED  12K Apr 10 21:07 .command.run
-rw-rw-r-- 1 progal REDACTED  345 Apr 10 21:07 .command.sh
-rw-rw-r-- 1 progal REDACTED  251 Apr 10 21:08 .command.trace
-rw-rw-r-- 1 progal REDACTED    1 Apr 10 21:08 .exitcode
-rw-rw-r-- 1 progal REDACTED 1.6K Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.amb
-rw-rw-r-- 1 progal REDACTED   55 Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.ann
-rw-rw-r-- 1 progal REDACTED 5.6M Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.bwt
-rw-rw-r-- 1 progal REDACTED 1.4M Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.pac
-rw-rw-r-- 1 progal REDACTED 2.8M Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.sa
-rw-rw-r-- 1 progal REDACTED   68 Apr 10 21:08 bwa_index_version.txt

And the SLURM log also shows it was completed

$ sacct --format=User,JobID%50,Jobname%50,partition,state,time,nnode,ncpus,elapsed,start,end | grep "13900461"
   progal                                           13900461             nf-indexRemapping_(2022187387-miniseq)   standard  COMPLETED   00:10:00        1          4   00:00:05 2024-04-10T21:08:00 2024-04-10T21:08:05
                                              13900461.batch                                              batch             COMPLETED                   1          4   00:00:05 2024-04-10T21:08:00 2024-04-10T21:08:05
                                             13900461.extern                                             extern             COMPLETED                   1          4   00:00:05 2024-04-10T21:08:00 2024-04-10T21:08:05

Did anyone find any other work around? Perhaps some SLURM specific argument that could be set? I tried making the queue size unlimited with queueSize = 0 but then I get a lot of errors from SLURM which crashes the pipeline unfortunately :/

pbieberstein avatar Apr 10 '24 21:04 pbieberstein

I looked into it further and noticed that it's after a specific timepoint at which the nextflow master node stops recognizing completed jobs (not missing random jobs throughout execution). For example, I started the pipeline at 22:34 on April 12 and then any job that was not finished by 01:32 on April 13 was not recognized as finished by the master node even though the .exitcode files eventually existed and contained 0 for already submitted tasks. And the master node continued running for another 20 hours. I've re-run with multiple SLURM queue sizes, global queue checking, etc. and the stalling happens at different timepoints, sometimes after 20 minutes (~450 completed tasks), sometimes after 3 hours (~2900 completed tasks). Does anyone have any hints how to troubleshoot this further? In the .nextflow.log every minute the pipeline does

Apr-16 01:46:45.892 [Task monitor] DEBUG n.executor.AbstractGridExecutor - Executor 'slurm' fetching queue global status

and recognizes COMPLETED and FAILED jobs but once it stalls it no longer does that, then it just gives this every 5 minutes and repeats for hours until I kill the master

Apr-16 01:50:31.659 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > No more task to compute -- The following nodes are still active:
[process] prokka
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] busco
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (value) bound ; channel: busco_lineages_path
  port 2: (cntrl) -     ; channel: $

[process] get_busco_lineages
  status=ACTIVE
  port 0: (value) OPEN  ; channel: busco_versions
  port 1: (cntrl) -     ; channel: $

[process] busco_plot
  status=ACTIVE
  port 0: (value) OPEN  ; channel: short_summaries
  port 1: (cntrl) -     ; channel: $

[process] quast
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] multiqc_assembly
  status=ACTIVE
  port 0: (value) bound ; channel: trim_logs
  port 1: (value) OPEN  ; channel: quast_stats
  port 2: (value) OPEN  ; channel: prokka_output
  port 3: (value) OPEN  ; channel: busco_summaries
  port 4: (cntrl) -     ; channel: $

[process] gtdbtk_classify_wf
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (value) bound ; channel: gtdb_database
  port 2: (cntrl) -     ; channel: $

[process] rMLST
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (value) bound ; channel: rMLST_database
  port 2: (cntrl) -     ; channel: $

[process] call_rMLST
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (value) bound ; channel: bigs_database_rMLST
  port 2: (cntrl) -     ; channel: $

[process] make_one_contig
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] indexRemapping
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] alignRemapping
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] parse_sam_for_insertsize
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] samtoolsRemapping
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] pilon_remapping
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] coverage_pilon_corrected
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] typing_16S
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] abricate
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] summary_sample
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] merge_summaries
  status=ACTIVE
  port 0: (value) OPEN  ; channel: sample_quality
  port 1: (cntrl) -     ; channel: $

[process] links_for_transfer
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] write_software_versions
  status=ACTIVE
  port 0: (value) OPEN  ; channel: version_files
  port 1: (cntrl) -     ; channel: $

Apr-16 01:55:31.676 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > No more task to compute -- The following nodes are still active:
[process] prokka
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] busco
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (value) bound ; channel: busco_lineages_path
  port 2: (cntrl) -     ; channel: $

[process] get_busco_lineages
  status=ACTIVE
  port 0: (value) OPEN  ; channel: busco_versions
  port 1: (cntrl) -     ; channel: $

[process] busco_plot
  status=ACTIVE
  port 0: (value) OPEN  ; channel: short_summaries
  port 1: (cntrl) -     ; channel: $

[process] quast
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] multiqc_assembly
  status=ACTIVE
  port 0: (value) bound ; channel: trim_logs
  port 1: (value) OPEN  ; channel: quast_stats
  port 2: (value) OPEN  ; channel: prokka_output
  port 3: (value) OPEN  ; channel: busco_summaries
  port 4: (cntrl) -     ; channel: $

[process] gtdbtk_classify_wf
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (value) bound ; channel: gtdb_database
  port 2: (cntrl) -     ; channel: $

[process] rMLST
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (value) bound ; channel: rMLST_database
  port 2: (cntrl) -     ; channel: $

[process] call_rMLST
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (value) bound ; channel: bigs_database_rMLST
  port 2: (cntrl) -     ; channel: $

[process] make_one_contig
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] indexRemapping
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] alignRemapping
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] parse_sam_for_insertsize
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] samtoolsRemapping
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] pilon_remapping
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] coverage_pilon_corrected
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] typing_16S
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] abricate
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] summary_sample
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] merge_summaries
  status=ACTIVE
  port 0: (value) OPEN  ; channel: sample_quality
  port 1: (cntrl) -     ; channel: $

[process] links_for_transfer
  status=ACTIVE
  port 0: (queue) OPEN  ; channel: -
  port 1: (cntrl) -     ; channel: $

[process] write_software_versions
  status=ACTIVE
  port 0: (value) OPEN  ; channel: version_files
  port 1: (cntrl) -     ; channel: $

Apr-16 02:00:31.694 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > No more task to compute -- The following nodes are still active:

Anyone have any hints what to look further into? I can't see any useful error messages anywhere in the logfile

pbieberstein avatar Apr 16 '24 00:04 pbieberstein

@pbieberstein: have you tried to create a Java thread dump - on the node on which the calculation was running that wrote the .exitCode file resp. the node on which this file is tried to be read? As can be seen from my thread dump https://github.com/nextflow-io/nextflow/issues/2695#issuecomment-1635939435, the problem in my case was not in nextflow, but I suspect somewhere in the interaction of Java with the file / storage system on the various slurm nodes. Have you tried calling "ls", "lsof" or "stat" for file ".exitCode" on the slurm node where the calculation was running or on the nextflow control node?

hmehlan avatar Apr 16 '24 07:04 hmehlan

Hi @hmehlan many thanks for your helpful response! I called "ls" and "stat" on .exitcode and some other files the work directory of one of the jobs that was finished but not recognized as such, but it didn't do anything for me. I did it through a SLURM interactive session running on the same SLURM Node as the control job and also on the same node as the SLURM node where the calculation was run. The nextflow master process was still running while I did this but it still didn't recognize the process as completed. The Java thread dump does look interesting but I can't quite identify what's going on yet (not so familiar with java debugging yet), do you see something suspicious here?

2024-04-19 21:34:20
Full thread dump OpenJDK 64-Bit Server VM (21-internal-adhoc.conda.src mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x000015442c002c80, length=19, elements={
0x000015457401b350, 0x00001545741406c0, 0x0000154574142a20, 0x00001545741444b0,
0x0000154574145a70, 0x0000154574147020, 0x0000154574148b70, 0x000015457414a1f0,
0x000015457419f060, 0x00001545741a1630, 0x0000154574ec08a0, 0x00001544b0002a80,
0x00001544b0009d20, 0x00001544b0006a60, 0x00001544b0003fc0, 0x00001544a0127e90,
0x00001544980d6250, 0x00001544cc1dadf0, 0x00001544a0133ec0
}

"main" #1 [3720502] prio=5 os_prio=0 cpu=5116.78ms elapsed=18638.65s tid=0x000015457401b350 nid=3720502 waiting on condition  [0x0000154578e09000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x00000000820c89b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(java.base@21-internal/AbstractQueuedSynchronizer.java:519)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@21-internal/ForkJoinPool.java:3780)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@21-internal/ForkJoinPool.java:3725)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@21-internal/AbstractQueuedSynchronizer.java:1707)
	at nextflow.util.Barrier$_awaitCompletion_closure3.doCall(Barrier.groovy:92)
	at nextflow.util.Barrier$_awaitCompletion_closure3.call(Barrier.groovy)
	at nextflow.extension.Bolts.withLock(Bolts.groovy:337)
	at nextflow.extension.Bolts.withLock(Bolts.groovy)
	at nextflow.util.Barrier.awaitCompletion(Barrier.groovy:89)
	at nextflow.Session.await(Session.groovy:659)
	at nextflow.script.ScriptRunner.await(ScriptRunner.groovy:255)
	at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:143)
	at nextflow.cli.CmdRun.run(CmdRun.groovy:372)
	at nextflow.cli.Launcher.run(Launcher.groovy:500)
	at nextflow.cli.Launcher.main(Launcher.groovy:672)

"Reference Handler" #9 [3720510] daemon prio=10 os_prio=0 cpu=6.12ms elapsed=18639.97s tid=0x00001545741406c0 nid=3720510 waiting on condition  [0x0000154558a48000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.ref.Reference.waitForReferencePendingList(java.base@21-internal/Native Method)
	at java.lang.ref.Reference.processPendingReferences(java.base@21-internal/Reference.java:246)
	at java.lang.ref.Reference$ReferenceHandler.run(java.base@21-internal/Reference.java:208)

"Finalizer" #10 [3720511] daemon prio=8 os_prio=0 cpu=0.43ms elapsed=18640.99s tid=0x0000154574142a20 nid=3720511 in Object.wait()  [0x0000154558947000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait0(java.base@21-internal/Native Method)
	- waiting on <0x00000000802b5590> (a java.lang.ref.NativeReferenceQueue$Lock)
	at java.lang.Object.wait(java.base@21-internal/Object.java:366)
	at java.lang.Object.wait(java.base@21-internal/Object.java:339)
	at java.lang.ref.NativeReferenceQueue.await(java.base@21-internal/NativeReferenceQueue.java:48)
	at java.lang.ref.ReferenceQueue.remove0(java.base@21-internal/ReferenceQueue.java:158)
	at java.lang.ref.NativeReferenceQueue.remove(java.base@21-internal/NativeReferenceQueue.java:89)
	- locked <0x00000000802b5590> (a java.lang.ref.NativeReferenceQueue$Lock)
	at java.lang.ref.Finalizer$FinalizerThread.run(java.base@21-internal/Finalizer.java:173)

"Signal Dispatcher" #11 [3720512] daemon prio=9 os_prio=0 cpu=0.35ms elapsed=18641.47s tid=0x00001545741444b0 nid=3720512 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Service Thread" #12 [3720513] daemon prio=9 os_prio=0 cpu=410.47ms elapsed=18641.98s tid=0x0000154574145a70 nid=3720513 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Monitor Deflation Thread" #13 [3720514] daemon prio=9 os_prio=0 cpu=1675.44ms elapsed=18642.44s tid=0x0000154574147020 nid=3720514 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #14 [3720515] daemon prio=9 os_prio=0 cpu=25514.02ms elapsed=18643.02s tid=0x0000154574148b70 nid=3720515 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

"C1 CompilerThread0" #16 [3720516] daemon prio=9 os_prio=0 cpu=4900.05ms elapsed=18643.20s tid=0x000015457414a1f0 nid=3720516 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

"Common-Cleaner" #17 [3720517] daemon prio=8 os_prio=0 cpu=25.19ms elapsed=18643.61s tid=0x000015457419f060 nid=3720517 runnable  [0x0000154558270000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x00000000802b56c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21-internal/LockSupport.java:269)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@21-internal/AbstractQueuedSynchronizer.java:1847)
	at java.lang.ref.ReferenceQueue.await(java.base@21-internal/ReferenceQueue.java:71)
	at java.lang.ref.ReferenceQueue.remove0(java.base@21-internal/ReferenceQueue.java:143)
	at java.lang.ref.ReferenceQueue.remove(java.base@21-internal/ReferenceQueue.java:218)
	at jdk.internal.ref.CleanerImpl.run(java.base@21-internal/CleanerImpl.java:140)
	at java.lang.Thread.runWith(java.base@21-internal/Thread.java:1596)
	at java.lang.Thread.run(java.base@21-internal/Thread.java:1583)
	at jdk.internal.misc.InnocuousThread.run(java.base@21-internal/InnocuousThread.java:186)

"Notification Thread" #18 [3720518] daemon prio=9 os_prio=0 cpu=0.14ms elapsed=18645.02s tid=0x00001545741a1630 nid=3720518 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"ForkJoinPool-1-worker-1" #26 [3720526] daemon prio=5 os_prio=0 cpu=17155.27ms elapsed=18642.31s tid=0x0000154574ec08a0  [0x0000154549bf9000]
   Carrying virtual thread #33
	at jdk.internal.vm.Continuation.run(java.base@21-internal/Continuation.java:251)
	at java.lang.VirtualThread.runContinuation(java.base@21-internal/VirtualThread.java:223)
	at java.lang.VirtualThread$$Lambda/0x00001544fc342348.run(java.base@21-internal/Unknown Source)
	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(java.base@21-internal/ForkJoinTask.java:1423)
	at java.util.concurrent.ForkJoinTask.doExec(java.base@21-internal/ForkJoinTask.java:387)
	at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@21-internal/ForkJoinPool.java:1312)
	at java.util.concurrent.ForkJoinPool.scan(java.base@21-internal/ForkJoinPool.java:1843)
	at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1808)
	at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)

"VirtualThread-unparker" #27 [3720527] daemon prio=5 os_prio=0 cpu=6497.62ms elapsed=18643.06s tid=0x00001544b0002a80 nid=3720527 runnable  [0x0000154549af9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x0000000081db5df0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21-internal/LockSupport.java:269)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@21-internal/AbstractQueuedSynchronizer.java:1758)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@21-internal/ScheduledThreadPoolExecutor.java:1182)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@21-internal/ScheduledThreadPoolExecutor.java:899)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@21-internal/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21-internal/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21-internal/ThreadPoolExecutor.java:642)
	at java.lang.Thread.runWith(java.base@21-internal/Thread.java:1596)
	at java.lang.Thread.run(java.base@21-internal/Thread.java:1583)
	at jdk.internal.misc.InnocuousThread.run(java.base@21-internal/InnocuousThread.java:186)

"ForkJoinPool-1-worker-2" #162 [3720661] daemon prio=5 os_prio=0 cpu=18833.40ms elapsed=18638.53s tid=0x00001544b0009d20 nid=3720661 waiting on condition  [0x00001545493f4000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
	at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
	at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1893)
	at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
	at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)

"ForkJoinPool-1-worker-3" #164 [3720662] daemon prio=5 os_prio=0 cpu=16503.86ms elapsed=18639.72s tid=0x00001544b0006a60 nid=3720662 runnable  [0x0000154548eef000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
	at java.util.concurrent.locks.LockSupport.parkUntil(java.base@21-internal/LockSupport.java:449)
	at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1891)
	at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
	at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)

"ForkJoinPool-1-worker-4" #165 [3720663] daemon prio=5 os_prio=0 cpu=18765.67ms elapsed=18640.00s tid=0x00001544b0003fc0 nid=3720663 waiting on condition  [0x0000154548aeb000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
	at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
	at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1893)
	at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
	at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)

"ForkJoinPool-1-worker-5" #14867 [3762619] daemon prio=5 os_prio=0 cpu=7270.12ms elapsed=9525.28s tid=0x00001544a0127e90 nid=3762619 waiting on condition  [0x00001544f5fee000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
	at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
	at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1893)
	at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
	at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)

"PublishDir-4135" #17107 [3766036] prio=5 os_prio=0 cpu=0.33ms elapsed=8400.33s tid=0x00001544980d6250 nid=3766036 runnable  [0x00001545494f5000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.fs.LinuxNativeDispatcher.directCopy0(java.base@21-internal/Native Method)
	at sun.nio.fs.LinuxFileSystem.directCopy(java.base@21-internal/LinuxFileSystem.java:159)
	at sun.nio.fs.UnixFileSystem.copyFile(java.base@21-internal/UnixFileSystem.java:682)
	at sun.nio.fs.UnixFileSystem.copy(java.base@21-internal/UnixFileSystem.java:1060)
	at sun.nio.fs.UnixFileSystemProvider.copy(java.base@21-internal/UnixFileSystemProvider.java:300)
	at nextflow.file.FileHelper.copyPath(FileHelper.groovy:951)
	at nextflow.processor.PublishDir.processFileImpl(PublishDir.groovy:496)
	at nextflow.processor.PublishDir.processFile(PublishDir.groovy:385)
	at nextflow.processor.PublishDir.safeProcessFile(PublishDir.groovy:362)
	at java.lang.invoke.LambdaForm$DMH/0x00001544fc3f2800.invokeVirtual(java.base@21-internal/LambdaForm$DMH)
	at java.lang.invoke.LambdaForm$MH/0x00001544fc59a000.invoke(java.base@21-internal/LambdaForm$MH)
	at java.lang.invoke.LambdaForm$MH/0x00001544fc15d000.invokeExact_MT(java.base@21-internal/LambdaForm$MH)
	at jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(java.base@21-internal/DirectMethodHandleAccessor.java:155)
	at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(java.base@21-internal/DirectMethodHandleAccessor.java:103)
	at java.lang.reflect.Method.invoke(java.base@21-internal/Method.java:580)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1036)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1019)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:97)
	at nextflow.processor.PublishDir$_apply1_closure2.doCall(PublishDir.groovy:337)
	at nextflow.processor.PublishDir$_apply1_closure2.call(PublishDir.groovy)
	at groovy.lang.Closure.run(Closure.java:498)
	at java.util.concurrent.Executors$RunnableAdapter.call(java.base@21-internal/Executors.java:572)
	at java.util.concurrent.FutureTask.run(java.base@21-internal/FutureTask.java:317)
	at java.lang.Thread.runWith(java.base@21-internal/Thread.java:1596)
	at java.lang.Thread.run(java.base@21-internal/Thread.java:1583)

"leveldb-compaction-0" #24118 [3779302] prio=5 os_prio=0 cpu=177.93ms elapsed=6090.47s tid=0x00001544cc1dadf0 nid=3779302 waiting on condition  [0x0000154548dee000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x00000000820b4348> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(java.base@21-internal/AbstractQueuedSynchronizer.java:519)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@21-internal/ForkJoinPool.java:3780)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@21-internal/ForkJoinPool.java:3725)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@21-internal/AbstractQueuedSynchronizer.java:1707)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@21-internal/LinkedBlockingQueue.java:435)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@21-internal/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21-internal/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21-internal/ThreadPoolExecutor.java:642)
	at java.lang.Thread.runWith(java.base@21-internal/Thread.java:1596)
	at java.lang.Thread.run(java.base@21-internal/Thread.java:1583)

"ForkJoinPool-1-worker-6" #24650 [3780205] daemon prio=5 os_prio=0 cpu=3025.30ms elapsed=5971.12s tid=0x00001544a0133ec0 nid=3780205 waiting on condition  [0x00001544f50df000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
	- parking to wait for  <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
	at java.util.concurrent.locks.LockSupport.parkUntil(java.base@21-internal/LockSupport.java:449)
	at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1891)
	at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
	at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)

"VM Thread" os_prio=0 cpu=660.25ms elapsed=18652.10s tid=0x00001545741305c0 nid=3720509 runnable

"GC Thread#0" os_prio=0 cpu=546.58ms elapsed=18652.42s tid=0x0000154574073da0 nid=3720503 runnable

"GC Thread#1" os_prio=0 cpu=531.21ms elapsed=18652.22s tid=0x00001544f0006870 nid=3720521 runnable

"GC Thread#2" os_prio=0 cpu=572.57ms elapsed=18652.89s tid=0x00001544f0007320 nid=3720522 runnable

"GC Thread#3" os_prio=0 cpu=559.83ms elapsed=18653.25s tid=0x00001544f000f4b0 nid=3720523 runnable

"G1 Main Marker" os_prio=0 cpu=0.70ms elapsed=18654.45s tid=0x00001545740853a0 nid=3720504 runnable

"G1 Conc#0" os_prio=0 cpu=180.53ms elapsed=18655.01s tid=0x0000154574086350 nid=3720505 runnable

"G1 Refine#0" os_prio=0 cpu=2204.85ms elapsed=18655.38s tid=0x00001545741033e0 nid=3720506 runnable

"G1 Service" os_prio=0 cpu=548.14ms elapsed=18655.79s tid=0x00001545741043a0 nid=3720507 runnable

"VM Periodic Task Thread" os_prio=0 cpu=7946.37ms elapsed=18655.87s tid=0x0000154574116c60 nid=3720508 waiting on condition

JNI global refs: 17, weak refs: 0

Heap
 garbage-first heap   total 324608K, used 282114K [0x0000000080000000, 0x0000000100000000)
  region size 1024K, 165 young (168960K), 12 survivors (12288K)
 Metaspace       used 43518K, committed 45504K, reserved 1114112K
  class space    used 5081K, committed 6016K, reserved 1048576K

pbieberstein avatar Apr 19 '24 20:04 pbieberstein

@pbieberstein: As far as I can see from the thread dump, nextflow is busy copying files:

   java.lang.Thread.State: RUNNABLE
	at sun.nio.fs.LinuxNativeDispatcher.directCopy0(java.base@21-internal/Native Method)
	at sun.nio.fs.LinuxFileSystem.directCopy(java.base@21-internal/LinuxFileSystem.java:159)
	at sun.nio.fs.UnixFileSystem.copyFile(java.base@21-internal/UnixFileSystem.java:682)
	at sun.nio.fs.UnixFileSystem.copy(java.base@21-internal/UnixFileSystem.java:1060) 

This should not be the problem, unless nextflow is stuck at this point. It is best to create another thread dump after some time and compare. There are free java thread dump analysers on the net - but even these had no further findings.

hmehlan avatar Apr 25 '24 09:04 hmehlan

@hmehlan

The "lsof trick" seems to do the job for us. I saw that, like us, you are using cephfs. In our case, the problem arose just after the update from an older ceph version to Ceph Reef 18.2.2 (not sure it is related, though).

Thanks for your interesting fix.

sybrohee avatar Jul 03 '24 07:07 sybrohee