nextflow
nextflow copied to clipboard
SLURM jobs being set as RUNNING when the actual state of the job is failed
Bug report
Expected behavior and actual behavior
Situation:
When SLURM jobs are submitted by Nextflow to a SLURM node, sometimes the SLURM node fail to startup. This causes the SLURM job to take NF
- Node Fail status.
Expected behaviour: Nextflow should report that this job failed because of node failure.
Actual behaviour: Nextflow interprets that the job started (because of code not handling this case) but fails to find the .exitcode
or mark the job as active (as it will never get into RUNNING state). So, the job throws error because of exitReadTimeOut eventually.
Steps to reproduce the problem
Nextflow process configured to run on a node that fails to startup.
Program output
I don't have the output. But, I have the nextflow.log file with TRACE enabled. Please look for Job ID: 4078
. Lines 13076, 13101, 13126
Environment
- Nextflow version: 22.10.6
- Java version: 17.0.9
- Operating system: Linux
- Bash version: GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)
Additional context
The problem I think is in https://github.com/nextflow-io/nextflow/blob/a0f69025854d843e0e12bac651c86bc552642e76/modules/nextflow/src/main/groovy/nextflow/executor/AbstractGridExecutor.groovy#L371-L385 .
This might be related to https://github.com/nextflow-io/nextflow/issues/4962 , but I am not sure... I didn't read through the full logs.