nextflow
nextflow copied to clipboard
Grid executor reports invalid queue status in SLURM executor
Bug report
I am experiencing the very same problem described in #1045, but in my case, I am working with a SLURM executor. The main problem is that for jobs lasting a few minutes, NF detects that the job has failed. However, the job execution continues and later on, I can find in the working dir the output, and the exit code file says '0'.
I had worked with the NF in this server some weeks ago and I did not have this issue. It has appeared this week.
Steps to reproduce the problem
Currently, this is happening with all the jobs longer than some minutes. The first time it fails and sometimes it works in the retry.
Program output
The relevant section of the .nextflow.log is:
runRecombClust (1); status: RUNNING; exit: -; error: -; workDir: /mnt/home/users/isglobal_001_bcn/carlosruiz/unidad_z/recombClust/inversionSimulation/work/32/2a304b685e017c5b1b9f6747e7970a started: 1592847037348; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270008
Current queue status:
> (empty)
Content of workDir: /mnt/home/users/isglobal_001_bcn/carlosruiz/unidad_z/recombClust/inversionSimulation/work/32/2a304b685e017c5b1b9f6747e7970a
null
jun-22 19:37:42.356 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 8147378; id: 1; name: runRecombClust (1); status: COMPLETED; exit: -; error: -; workDir: /mnt/home/users/isglobal_001_bcn/carlosruiz/unidad_z/recombClust/inversionSimulation/work/32/2a304b685e017c5b1b9f6747e7970a started: 1592847037348; exited: -; ]
jun-22 19:37:42.392 [Task monitor] INFO nextflow.processor.TaskProcessor - [32/2a304b] NOTE: Process `runRecombClust (1)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
I have updated NF to 20.04 and the issue still persists.
Environment
- Nextflow version: 20.04.1 build 5335
- Java version: 1.8.0_60
- Operating system: Linux
- Slurm: 17.11.4
- Bash version: GNU bash, versión 4.3.42(1)-release
We're experiencing the same issue with a SLURM executor. Job completes successfully but trips one or more of nextflow checks, resulting in the output never getting published and the entire pipeline getting terminated. Relevant portion of .nextflow.log
is as follows:
Content of workDir: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff
> total 4
> drwxrwxr-x 2 mn215 mn215 4096 Jul 21 12:42 .
> drwxrwxr-x 2 mn215 mn215 4096 Jul 21 12:39 ..
> -rw-rw-r-- 1 mn215 mn215 3402 Jul 21 12:42 .command.run
> -rw-rw-r-- 1 mn215 mn215 190 Jul 21 12:42 .command.sh
Jul-21 13:17:29.372 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 13928531; id: 424; name: quantification (57); status: COMPLETED; exit: -; error: -; workDir: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff started: 1595351423188; exited: -; ]
Jul-21 13:17:29.440 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff/.command.out
Jul-21 13:17:29.442 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff/.command.err
Jul-21 13:17:29.443 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff/.command.log
Jul-21 13:17:29.445 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'quantification (57)'
Caused by:
Process `quantification (57)` terminated for an unknown reason -- Likely it has been terminated by the external system
However, direct inspection of the work directory after the pipeline terminates shows that the files nextflow couldn't find are in fact there:
(base) [as773@compute-e-16-233 fc4c2ef099116103e6183d117acdff]$ ls -la
total 6102
drwxrwxr-x 2 mn215 mn215 4096 Jul 21 13:18 .
drwxrwxr-x 2 mn215 mn215 4096 Jul 21 12:39 ..
-rw-rw-r-- 1 mn215 mn215 6241079 Jul 21 13:18 66.csv
lrwxrwxrwx 1 mn215 mn215 59 Jul 21 13:18 66.tif -> /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11/dearray/66.tif
lrwxrwxrwx 1 mn215 mn215 101 Jul 21 13:18 cellMask.tif -> /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/d2/a6febc6e2974a4169a1b28a86bbd49/66/cellMask.tif
-rw-rw-r-- 1 mn215 mn215 0 Jul 21 13:18 .command.begin
-rw-rw-r-- 1 mn215 mn215 0 Jul 21 13:18 .command.err
-rw-rw-r-- 1 mn215 mn215 906 Jul 21 13:18 .command.log
-rw-rw-r-- 1 mn215 mn215 906 Jul 21 13:18 .command.out
-rw-rw-r-- 1 mn215 mn215 3402 Jul 21 12:42 .command.run
-rw-rw-r-- 1 mn215 mn215 190 Jul 21 12:42 .command.sh
-rw-rw-r-- 1 mn215 mn215 1 Jul 21 13:18 .exitcode
lrwxrwxrwx 1 mn215 mn215 56 Jul 21 13:18 markers.csv -> /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11/markers.csv
In the above, 66.csv
is the output file produced by the corresponding process. We manually verified that it has the expected content. Additionally, .exitcode
is 0
, implying that the process finished successfully.
One potential explanation is given in this comment: https://github.com/nextflow-io/nextflow/issues/931#issuecomment-470156921 which suggests that:
Once the exit code file is detected, the main Nextflow program will immediately (<1ms?) check for the presence of the output files. If the output files have not appeared yet, Nextflow does not wait for them, and the process will fail.
It seems that there's a related issue at https://github.com/nextflow-io/nextflow/issues/1078, but it was closed due to inactivity.
I'm running into the same issue, also with v20.04.1+SLURM and one of the jobs continuing to run even though it is listed as failed. The error occurs exactly after 7 minutes, as in #1235.
Last few lines of trace from -with-trace
...
1234 55/549cbd 9657621 quandenser_parallel_1_dinosaur (1233) COMPLETED 0 2020-07-22 10:10:48.067 8.1s 6.2s 7.4% 21.7 MB 108.5 MB 9 MB 2.6 MB
1229 0b/92f769 9657622 quandenser_parallel_1_dinosaur (1228) COMPLETED 0 2020-07-22 10:10:48.297 7.9s 6.2s 7.4% 21.6 MB 108.5 MB 9 MB 2.6 MB
1246 49/bf6c07 9657623 quandenser_parallel_1_dinosaur (1245) COMPLETED 0 2020-07-22 10:10:48.333 8s 6.2s 7.2% 21.6 MB 108.5 MB 8.7 MB 2.5 MB
1245 2d/c8f5d9 9657624 quandenser_parallel_1_dinosaur (1244) COMPLETED 0 2020-07-22 10:10:48.364 8.1s 6.2s 8.3% 21.8 MB 108.5 MB 8.7 MB 2.5 MB
1248 fc/461434 9657625 quandenser_parallel_2_maracluster_parallel_1_index (1) COMPLETED 0 2020-07-22 10:10:58.142 48.6s 35.2s 19.2% 1.2 GB 1.3 GB 564.9 KB 803.8 KB
1252 82/ec6d00 9657628 quandenser_parallel_2_maracluster_parallel_2_pvalue (4) COMPLETED 0 2020-07-22 10:11:47.815 18.9s 12.2s 15.3% 9.1 MB 55.1 MB 280.9 KB 292.9 KB
1251 3e/3753ee 9657629 quandenser_parallel_2_maracluster_parallel_2_pvalue (3) FAILED - 2020-07-22 10:11:47.874 7m 4s 7m - - - - -
1249 d0/09c159 9657630 quandenser_parallel_2_maracluster_parallel_2_pvalue (1) ABORTED - 2020-07-22 10:11:47.913 - - - - - - -
1253 df/9e4796 9657631 quandenser_parallel_2_maracluster_parallel_2_pvalue (5) ABORTED - 2020-07-22 10:11:47.955 - - - - - - -
1254 16/6246a6 9657632 quandenser_parallel_2_maracluster_parallel_2_pvalue (6) ABORTED - 2020-07-22 10:11:47.989 - - - - - - -
1250 e7/609cec 9657633 quandenser_parallel_2_maracluster_parallel_2_pvalue (2) ABORTED - 2020-07-22 10:11:48.032 - - - - - - -
1256 3b/f412aa 9657634 quandenser_parallel_2_maracluster_parallel_2_pvalue (8) ABORTED - 2020-07-22 10:11:48.113 - - - - - - -
1255 24/ee7685 9657635 quandenser_parallel_2_maracluster_parallel_2_pvalue (7) ABORTED - 2020-07-22 10:11:48.155 - - - - - - -
1257 51/ca18e5 9657636 quandenser_parallel_2_maracluster_parallel_2_pvalue (9) ABORTED - 2020-07-22 10:11:48.193 - - - - - - -
1258 e4/765c35 9657637 quandenser_parallel_2_maracluster_parallel_2_pvalue (10) ABORTED - 2020-07-22 10:11:48.235 - - - - - - -
1259 20/e221df 9657638 quandenser_parallel_2_maracluster_parallel_2_pvalue (11) ABORTED - 2020-07-22 10:11:48.275 - - - - - - -
1260 ce/cd8ba4 9657639 quandenser_parallel_2_maracluster_parallel_2_pvalue (12) ABORTED - 2020-07-22 10:11:48.336 - - - - - - -
1249 d0/09c159 9657630 quandenser_parallel_2_maracluster_parallel_2_pvalue (1) FAILED - 2020-07-22 10:11:47.913 7m 4s 7m - - - - -
In my case, the problem turned out to be that SLURM - for some unknown weird reason - submitted the jobs to a different queue than the one specified by the process.queue
configuration variable. This caused the squeue
command to not list these jobs and cause nextflow to fail as it could not get the status of these jobs.
Removing the process.queue
configuration variable from the configuration file (forcing SLURM to use the default queue when submitting and list all jobs when using squeue
) fixed things for me.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Maybe not stale. I having the same problem and have not figured out a solution.
-- exitStatusReadTimeoutMillis: 270000; delta: 422931 Current queue status:
job: 871400: ERROR job: 871398: ERROR job: 871399: ERROR job: 871397: ERROR job: 871396: ERROR job: 871394: ERROR job: 871395: ERROR job: 871392: ERROR job: 871393: ERROR job: 871391: ERROR Here is my typical error message, I don't have such issues before. Just suddenly appeared.
N E X T F L O W version 20.10.0 build 5430 created 01-11-2020 15:14 UTC (07:14 PDT) cite doi:10.1038/nbt.3820 http://nextflow.io This is my nextflow version
And my pipeline did reach a final death, after so many errors. Usually my work directory contains only two files: . .. .command.run .command.sh The the symlink for input file has not been set up yet. The error message is typically:
Error executing process > 'one_process (v112895)'
Caused by:
Process one_process (v112895)
terminated for an unknown reason -- Likely it has been terminated by the external system
look the error status is coming from slurm. If anyone found a solution please let me know.
This issue may be difficult for the software team to reproduce on their computer. So the solution may have to be from the users.
Got it. Sorry. I will add to the same post.
It seems that my problem went away if I start a new directory (a different NSF subdir). This does not help to solve the problem, but at least it is one thing to try. Only thing I can think is that at the first launch, something bad happened to my NSF and caused Nextflow to get the wrong information from slurm, or more likely slurm is sending the wrong signal. I spend a whole day trying to relaunch from first launch directory, and reading almost every line of the log and screeen output and the problem repeats. Also slurm log file is helpful: jobcomp.log. Just to be sure, my problem is due to BAD NSF, so no need from the nf developer to do anything. I would say this issue is close.
Hi @kemin711 ,
If you press the three dots "..." in the top-right corner of the post, there is an option to edit your post. My general recommendation is to consolidate all information into a single post to make it easier to follow. (It will also prevent a string of e-mails to everybody who is watching this issue, every time you want to add more information.)
Best, -Artem
I agree GitHub issues is not a chat
180216
I have the same issue and I want to ask others whether a retry works.
I am considering:
withName: '.*' {
errorStrategy = 'retry'
maxErrors = '2'
}
Update: I can confirm the above does not fix it for me.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi,
It looks like I have a similar issue : submitting SLURM jobs to a given queue results in
invalid status line: squeue: error: Invalid user: ?
And the job is considered by Nextflow as
Process PrepareGenome:BWAIndexGenome (1) terminated for an unknown reason -- Likely it has been terminated by the external system
even though it’s still running.
On this cluster, not setting the queue is not an option. Any ideas on how to proceeed ? Full log attached. Thanks ! nextflow.log
More than the queue it looks like the problem is the lack of the user option when nextflow runs the squeue
command
I’ve tried user.name = myuser
in nextflow.config
to no avail. How does system.getProperty
works ?
Relevant code looks to be https://github.com/nextflow-io/nextflow/blob/58748225f9a2b296779f459016537bcf466dc645/src/main/groovy/nextflow/executor/SlurmExecutor.groovy#L121
More debugging info :
- the issue does not seem to occur on Nextflow 22.04.0.5697
- With groovy,
println(System.getProperty(user.name))
returns my username as expected
After more testing :
- building from source results in version 22.04.5.5708 : works
- installing it from nix results in version 22.04.5.5709 : have the issue
- running the executable manually for 22.54.5709 (nix only get the executable) : works
So I’m a bit stumped why the nix package does not work... It’s basically the same file.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This is still an issue with latest nextflow.