nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Grid executor reports invalid queue status in SLURM executor

Open yocra3 opened this issue 4 years ago • 20 comments

Bug report

I am experiencing the very same problem described in #1045, but in my case, I am working with a SLURM executor. The main problem is that for jobs lasting a few minutes, NF detects that the job has failed. However, the job execution continues and later on, I can find in the working dir the output, and the exit code file says '0'.

I had worked with the NF in this server some weeks ago and I did not have this issue. It has appeared this week.

Steps to reproduce the problem

Currently, this is happening with all the jobs longer than some minutes. The first time it fails and sometimes it works in the retry.

Program output

The relevant section of the .nextflow.log is:

runRecombClust (1); status: RUNNING; exit: -; error: -; workDir: /mnt/home/users/isglobal_001_bcn/carlosruiz/unidad_z/recombClust/inversionSimulation/work/32/2a304b685e017c5b1b9f6747e7970a started: 1592847037348; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270008
Current queue status:
>   (empty)

Content of workDir: /mnt/home/users/isglobal_001_bcn/carlosruiz/unidad_z/recombClust/inversionSimulation/work/32/2a304b685e017c5b1b9f6747e7970a
null
jun-22 19:37:42.356 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 8147378; id: 1; name: runRecombClust (1); status: COMPLETED; exit: -; error: -; workDir: /mnt/home/users/isglobal_001_bcn/carlosruiz/unidad_z/recombClust/inversionSimulation/work/32/2a304b685e017c5b1b9f6747e7970a started: 1592847037348; exited: -; ]
jun-22 19:37:42.392 [Task monitor] INFO  nextflow.processor.TaskProcessor - [32/2a304b] NOTE: Process `runRecombClust (1)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)

I have updated NF to 20.04 and the issue still persists.

Environment

  • Nextflow version: 20.04.1 build 5335
  • Java version: 1.8.0_60
  • Operating system: Linux
  • Slurm: 17.11.4
  • Bash version: GNU bash, versión 4.3.42(1)-release

nextflow.log

yocra3 avatar Jun 23 '20 15:06 yocra3

We're experiencing the same issue with a SLURM executor. Job completes successfully but trips one or more of nextflow checks, resulting in the output never getting published and the entire pipeline getting terminated. Relevant portion of .nextflow.log is as follows:

Content of workDir: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff
> total 4
> drwxrwxr-x 2 mn215 mn215 4096 Jul 21 12:42 .
> drwxrwxr-x 2 mn215 mn215 4096 Jul 21 12:39 ..
> -rw-rw-r-- 1 mn215 mn215 3402 Jul 21 12:42 .command.run
> -rw-rw-r-- 1 mn215 mn215  190 Jul 21 12:42 .command.sh

Jul-21 13:17:29.372 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 13928531; id: 424; name: quantification (57); status: COMPLETED; exit: -; error: -; workDir: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff started: 1595351423188; exited: -; ]
Jul-21 13:17:29.440 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff/.command.out
Jul-21 13:17:29.442 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff/.command.err
Jul-21 13:17:29.443 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/01/fc4c2ef099116103e6183d117acdff/.command.log
Jul-21 13:17:29.445 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'quantification (57)'

Caused by:
  Process `quantification (57)` terminated for an unknown reason -- Likely it has been terminated by the external system

However, direct inspection of the work directory after the pipeline terminates shows that the files nextflow couldn't find are in fact there:

(base) [as773@compute-e-16-233 fc4c2ef099116103e6183d117acdff]$ ls -la
total 6102
drwxrwxr-x 2 mn215 mn215    4096 Jul 21 13:18 .
drwxrwxr-x 2 mn215 mn215    4096 Jul 21 12:39 ..
-rw-rw-r-- 1 mn215 mn215 6241079 Jul 21 13:18 66.csv
lrwxrwxrwx 1 mn215 mn215      59 Jul 21 13:18 66.tif -> /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11/dearray/66.tif
lrwxrwxrwx 1 mn215 mn215     101 Jul 21 13:18 cellMask.tif -> /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11_OUTPUT/d2/a6febc6e2974a4169a1b28a86bbd49/66/cellMask.tif
-rw-rw-r-- 1 mn215 mn215       0 Jul 21 13:18 .command.begin
-rw-rw-r-- 1 mn215 mn215       0 Jul 21 13:18 .command.err
-rw-rw-r-- 1 mn215 mn215     906 Jul 21 13:18 .command.log
-rw-rw-r-- 1 mn215 mn215     906 Jul 21 13:18 .command.out
-rw-rw-r-- 1 mn215 mn215    3402 Jul 21 12:42 .command.run
-rw-rw-r-- 1 mn215 mn215     190 Jul 21 12:42 .command.sh
-rw-rw-r-- 1 mn215 mn215       1 Jul 21 13:18 .exitcode
lrwxrwxrwx 1 mn215 mn215      56 Jul 21 13:18 markers.csv -> /n/scratch3/users/m/mn215/mcmicro/TMAs/TMA11/markers.csv

In the above, 66.csv is the output file produced by the corresponding process. We manually verified that it has the expected content. Additionally, .exitcode is 0, implying that the process finished successfully.

One potential explanation is given in this comment: https://github.com/nextflow-io/nextflow/issues/931#issuecomment-470156921 which suggests that:

Once the exit code file is detected, the main Nextflow program will immediately (<1ms?) check for the presence of the output files. If the output files have not appeared yet, Nextflow does not wait for them, and the process will fail.

It seems that there's a related issue at https://github.com/nextflow-io/nextflow/issues/1078, but it was closed due to inactivity.

ArtemSokolov avatar Jul 21 '20 19:07 ArtemSokolov

I'm running into the same issue, also with v20.04.1+SLURM and one of the jobs continuing to run even though it is listed as failed. The error occurs exactly after 7 minutes, as in #1235.

Last few lines of trace from -with-trace

...
1234    55/549cbd       9657621 quandenser_parallel_1_dinosaur (1233)   COMPLETED       0       2020-07-22 10:10:48.067 8.1s    6.2s    7.4%    21.7 MB 108.5 MB        9 MB    2.6 MB
1229    0b/92f769       9657622 quandenser_parallel_1_dinosaur (1228)   COMPLETED       0       2020-07-22 10:10:48.297 7.9s    6.2s    7.4%    21.6 MB 108.5 MB        9 MB    2.6 MB
1246    49/bf6c07       9657623 quandenser_parallel_1_dinosaur (1245)   COMPLETED       0       2020-07-22 10:10:48.333 8s      6.2s    7.2%    21.6 MB 108.5 MB        8.7 MB  2.5 MB
1245    2d/c8f5d9       9657624 quandenser_parallel_1_dinosaur (1244)   COMPLETED       0       2020-07-22 10:10:48.364 8.1s    6.2s    8.3%    21.8 MB 108.5 MB        8.7 MB  2.5 MB
1248    fc/461434       9657625 quandenser_parallel_2_maracluster_parallel_1_index (1)  COMPLETED       0       2020-07-22 10:10:58.142 48.6s   35.2s   19.2%   1.2 GB  1.3 GB  564.9 KB        803.8 KB
1252    82/ec6d00       9657628 quandenser_parallel_2_maracluster_parallel_2_pvalue (4) COMPLETED       0       2020-07-22 10:11:47.815 18.9s   12.2s   15.3%   9.1 MB  55.1 MB 280.9 KB        292.9 KB
1251    3e/3753ee       9657629 quandenser_parallel_2_maracluster_parallel_2_pvalue (3) FAILED  -       2020-07-22 10:11:47.874 7m 4s   7m      -       -       -       -       -
1249    d0/09c159       9657630 quandenser_parallel_2_maracluster_parallel_2_pvalue (1) ABORTED -       2020-07-22 10:11:47.913 -       -       -       -       -       -       -
1253    df/9e4796       9657631 quandenser_parallel_2_maracluster_parallel_2_pvalue (5) ABORTED -       2020-07-22 10:11:47.955 -       -       -       -       -       -       -
1254    16/6246a6       9657632 quandenser_parallel_2_maracluster_parallel_2_pvalue (6) ABORTED -       2020-07-22 10:11:47.989 -       -       -       -       -       -       -
1250    e7/609cec       9657633 quandenser_parallel_2_maracluster_parallel_2_pvalue (2) ABORTED -       2020-07-22 10:11:48.032 -       -       -       -       -       -       -
1256    3b/f412aa       9657634 quandenser_parallel_2_maracluster_parallel_2_pvalue (8) ABORTED -       2020-07-22 10:11:48.113 -       -       -       -       -       -       -
1255    24/ee7685       9657635 quandenser_parallel_2_maracluster_parallel_2_pvalue (7) ABORTED -       2020-07-22 10:11:48.155 -       -       -       -       -       -       -
1257    51/ca18e5       9657636 quandenser_parallel_2_maracluster_parallel_2_pvalue (9) ABORTED -       2020-07-22 10:11:48.193 -       -       -       -       -       -       -
1258    e4/765c35       9657637 quandenser_parallel_2_maracluster_parallel_2_pvalue (10)        ABORTED -       2020-07-22 10:11:48.235 -       -       -       -       -       -       -
1259    20/e221df       9657638 quandenser_parallel_2_maracluster_parallel_2_pvalue (11)        ABORTED -       2020-07-22 10:11:48.275 -       -       -       -       -       -       -
1260    ce/cd8ba4       9657639 quandenser_parallel_2_maracluster_parallel_2_pvalue (12)        ABORTED -       2020-07-22 10:11:48.336 -       -       -       -       -       -       -
1249    d0/09c159       9657630 quandenser_parallel_2_maracluster_parallel_2_pvalue (1) FAILED  -       2020-07-22 10:11:47.913 7m 4s   7m      -       -       -       -       -

MatthewThe avatar Jul 22 '20 08:07 MatthewThe

In my case, the problem turned out to be that SLURM - for some unknown weird reason - submitted the jobs to a different queue than the one specified by the process.queue configuration variable. This caused the squeue command to not list these jobs and cause nextflow to fail as it could not get the status of these jobs.

Removing the process.queue configuration variable from the configuration file (forcing SLURM to use the default queue when submitting and list all jobs when using squeue) fixed things for me.

MatthewThe avatar Jul 27 '20 15:07 MatthewThe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 24 '20 18:12 stale[bot]

Maybe not stale. I having the same problem and have not figured out a solution.

kemin711 avatar Jun 15 '21 16:06 kemin711

-- exitStatusReadTimeoutMillis: 270000; delta: 422931 Current queue status:

job: 871400: ERROR job: 871398: ERROR job: 871399: ERROR job: 871397: ERROR job: 871396: ERROR job: 871394: ERROR job: 871395: ERROR job: 871392: ERROR job: 871393: ERROR job: 871391: ERROR Here is my typical error message, I don't have such issues before. Just suddenly appeared.

kemin711 avatar Jun 15 '21 16:06 kemin711

N E X T F L O W version 20.10.0 build 5430 created 01-11-2020 15:14 UTC (07:14 PDT) cite doi:10.1038/nbt.3820 http://nextflow.io This is my nextflow version

kemin711 avatar Jun 15 '21 16:06 kemin711

And my pipeline did reach a final death, after so many errors. Usually my work directory contains only two files: . .. .command.run .command.sh The the symlink for input file has not been set up yet. The error message is typically:

Error executing process > 'one_process (v112895)'

Caused by: Process one_process (v112895) terminated for an unknown reason -- Likely it has been terminated by the external system

kemin711 avatar Jun 15 '21 17:06 kemin711

look the error status is coming from slurm. If anyone found a solution please let me know.

kemin711 avatar Jun 15 '21 18:06 kemin711

This issue may be difficult for the software team to reproduce on their computer. So the solution may have to be from the users. Got it. Sorry. I will add to the same post.
It seems that my problem went away if I start a new directory (a different NSF subdir). This does not help to solve the problem, but at least it is one thing to try. Only thing I can think is that at the first launch, something bad happened to my NSF and caused Nextflow to get the wrong information from slurm, or more likely slurm is sending the wrong signal. I spend a whole day trying to relaunch from first launch directory, and reading almost every line of the log and screeen output and the problem repeats. Also slurm log file is helpful: jobcomp.log. Just to be sure, my problem is due to BAD NSF, so no need from the nf developer to do anything. I would say this issue is close.

kemin711 avatar Jun 15 '21 18:06 kemin711

Hi @kemin711 ,

If you press the three dots "..." in the top-right corner of the post, there is an option to edit your post. My general recommendation is to consolidate all information into a single post to make it easier to follow. (It will also prevent a string of e-mails to everybody who is watching this issue, every time you want to add more information.)

Best, -Artem

ArtemSokolov avatar Jun 15 '21 18:06 ArtemSokolov

I agree GitHub issues is not a chat

pditommaso avatar Jun 17 '21 16:06 pditommaso

180216

Manuelon35 avatar Jul 17 '21 23:07 Manuelon35

I have the same issue and I want to ask others whether a retry works.

I am considering:

	withName: '.*' {
		errorStrategy = 'retry'
		maxErrors = '2'
	}

Update: I can confirm the above does not fix it for me.

knservis avatar Dec 06 '21 10:12 knservis

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 11 '22 03:06 stale[bot]

Hi, It looks like I have a similar issue : submitting SLURM jobs to a given queue results in invalid status line: squeue: error: Invalid user: ?

And the job is considered by Nextflow as Process PrepareGenome:BWAIndexGenome (1) terminated for an unknown reason -- Likely it has been terminated by the external system even though it’s still running.

On this cluster, not setting the queue is not an option. Any ideas on how to proceeed ? Full log attached. Thanks ! nextflow.log

apraga avatar Sep 07 '22 20:09 apraga

More than the queue it looks like the problem is the lack of the user option when nextflow runs the squeue command

pditommaso avatar Sep 08 '22 07:09 pditommaso

I’ve tried user.name = myuser in nextflow.config to no avail. How does system.getProperty works ? Relevant code looks to be https://github.com/nextflow-io/nextflow/blob/58748225f9a2b296779f459016537bcf466dc645/src/main/groovy/nextflow/executor/SlurmExecutor.groovy#L121

apraga avatar Sep 08 '22 07:09 apraga

More debugging info :

  • the issue does not seem to occur on Nextflow 22.04.0.5697
  • With groovy, println(System.getProperty(user.name)) returns my username as expected

apraga avatar Sep 08 '22 21:09 apraga

After more testing :

  • building from source results in version 22.04.5.5708 : works
  • installing it from nix results in version 22.04.5.5709 : have the issue
  • running the executable manually for 22.54.5709 (nix only get the executable) : works

So I’m a bit stumped why the nix package does not work... It’s basically the same file.

apraga avatar Sep 10 '22 18:09 apraga

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 18 '23 10:03 stale[bot]

This is still an issue with latest nextflow.

apraga avatar Jun 10 '23 08:06 apraga