nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

SLURM jobs being set as RUNNING when the actual state of the job is failed

Open sgopalan98 opened this issue 5 months ago • 6 comments

Bug report

Expected behavior and actual behavior

Situation: When SLURM jobs are submitted by Nextflow to a SLURM node, sometimes the SLURM node fail to startup. This causes the SLURM job to take NF - Node Fail status.

Expected behaviour: Nextflow should report that this job failed because of node failure. Actual behaviour: Nextflow interprets that the job started (because of code not handling this case) but fails to find the .exitcode or mark the job as active (as it will never get into RUNNING state). So, the job throws error because of exitReadTimeOut eventually.

Steps to reproduce the problem

Nextflow process configured to run on a node that fails to startup.

Program output

I don't have the output. But, I have the nextflow.log file with TRACE enabled. Please look for Job ID: 4078. Lines 13076, 13101, 13126

Environment

  • Nextflow version: 22.10.6
  • Java version: 17.0.9
  • Operating system: Linux
  • Bash version: GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

Additional context

The problem I think is in https://github.com/nextflow-io/nextflow/blob/a0f69025854d843e0e12bac651c86bc552642e76/modules/nextflow/src/main/groovy/nextflow/executor/AbstractGridExecutor.groovy#L371-L385 .

This might be related to https://github.com/nextflow-io/nextflow/issues/4962 , but I am not sure... I didn't read through the full logs.

sgopalan98 avatar Sep 12 '24 04:09 sgopalan98