nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Job arrays

Open bentsherman opened this issue 1 year ago • 85 comments

Closes #1477 (and possibly #1427)

Summary of changes:

  • Adds array directive to submit tasks as array jobs of a given size

  • TaskArrayCollector collects tasks into arrays and submits each array job when it is ready to the underlying executor. The executor must implement the TaskArrayAware interface. Each process has its own array job collector.

    When all input channels to a process have received the "poison pill", the process is "closed" and the array job collector is notified so that it can submit any remaining tasks. All subsequent tasks (e.g. retries) will be submitted as individual tasks.

  • TaskArray is a special type of TaskRun for an array job that holds the list of child task handlers. For an executor that supports array jobs, the task handler can check whether its task is a TaskArray to apply perform job specific behavior.

  • TaskHandler has a few more methods, which the array job collector uses to create the array job script. This script simply defines the list of child work directories, selects a work dir based on the index, and launches the child task using an executor-specific launch command.

  • TaskPollingMonitor has been modified to handle both array jobs and regular tasks. The array job is handled like any other task, but then discarded once it has been submitted. The task stats reported by Nextflow are the same with or without array jobs -- array jobs are not included in the task stats.

Here's the pipeline I'm using as the e2e test:

params.n_tasks = 50

process foo {
    array 10

    input: val index
    output: path 'output.txt'

    """
    echo "Hello from task ${index}!" > output.txt
    """
}

process bar {
    debug true
    array 10

    input: path 'input.txt'

    """
    cat input.txt
    """
}

workflow {
    Channel.of(1 .. params.n_tasks) | foo | bar
}

TODO:

  • [x] documentation
  • [x] unit tests
  • extra
    • [x] AWS Batch: kill array job instead of child jobs
    • [x] Google Batch: kill array job instead of child jobs
    • [x] Grid executors: add array index environment var to container
    • [x] handle retried tasks with dynamic resources
  • manual e2e tests
    • [x] SLURM
    • [x] SLURM + Fusion
    • [x] AWS Batch
    • [x] AWS Batch + Fusion
    • [x] Google Batch
    • [x] Google Batch + Fusion

bentsherman avatar Apr 21 '23 15:04 bentsherman

  • The array task handler sets the task field to the first task in the array. In other words, it assumes that all tasks in the array have the same basic configuration (process, queue, resource requirements, etc). This is a requirement for array jobs anyway, for example with grid executors the cluster options in the job script must be the same for all tasks in the array job. I will add a caveat to the docs, but users who are familiar with array jobs will understand this already.

bentsherman avatar Apr 21 '23 15:04 bentsherman

~Currently, if the array executor submits a partial batch at the end, and some of those tasks fail and need to be retried, the array executor will hang because it has already received the onProcessClose event. I think we can address this issue by making the array executor submit any such "straggler" tasks individually.~

UPDATE: fixed

bentsherman avatar Apr 21 '23 17:04 bentsherman

If a task fails and is retried with increased resources, it will be batched with other tasks that may still be on their first attempt. In that case, the array job resources will depend on whichever tasks happens to be first in the batch.

One solution is to take the max value of cpus, memory, time, etc for all tasks in an array job. That would be "safe" but likely much more expensive -- if a single task requests twice the resources, suddenly the entire array job does as well.

Another solution is to further separate batches by configuration, to ensure that they are uniform. We could go crazy and separate batches by the tuple of (cpus, memory, time, ...), but I think that would be overkill. Better I think to just split based on attempt and tell users to "handle with care".

bentsherman avatar Apr 21 '23 18:04 bentsherman

We could also just provide config options for these things:

  • executor.$array.groupKeys (default: ['process', 'attempt']) controls how batches are separated

  • executor.$array.requestMaxResources controls whether the array executor "plays if safe" by taking the max resources across all tasks in an array

bentsherman avatar Apr 21 '23 18:04 bentsherman

The point of these config options is that there is a trade-off between bandwidth and latency when batching tasks like this, so users should ideally have the ability to manage that trade-off in a way that best fits their use case. If someone doesn't use retry with dynamic resources, then they don't need to group by attempt, and vise versa.

bentsherman avatar Apr 21 '23 18:04 bentsherman

  • ~We may need to fetch the child job ids in order to check their status, in cases where the job id doesn't have it's own status.~

UPDATE: looks like the child job id can be derived from the array job id and index

bentsherman avatar Apr 21 '23 19:04 bentsherman

With the latest commit I was able to run a test pipeline with SLURM array jobs 🎉

Now I will extend support to other executors, although I won't be able to test most of them myself.

bentsherman avatar Apr 24 '23 19:04 bentsherman

Awesome 🎉. Give a chance to review it before entering the next phase

pditommaso avatar Apr 24 '23 19:04 pditommaso

Awesome - I'd be happy to help test this one with PBS (the uni cluster here)

Let me know if there's a pipeline which you are using for testing them at user-level.

abhi18av avatar Apr 25 '23 06:04 abhi18av

Thanks @abhi18av , that would be great. We're going to do some refactoring and then I'll ping you when the PR is closer to a final draft.

bentsherman avatar Apr 25 '23 13:04 bentsherman

@pditommaso While array jobs are highly requested, there is further demand for "true" task grouping in which a group of tasks are executed on the same resources, especially in the cloud where VM startup time is expensive.

I'm fine with supporting array jobs through an array process directive, but in the future we will need to have another directive for task batching, e.g. group or batch. In any case, I think array should be clear enough for array jobs, regardless of what we call the task grouping feature.

bentsherman avatar Apr 25 '23 16:04 bentsherman

The task grouping would probably work similarly to this PR -- instead of calling each .command.run in an array, you would create a single .command.run and call each .command.sh in an array. That way the job could run on a single node.

bentsherman avatar Apr 25 '23 16:04 bentsherman

Indeed, the main difference with grouping will be:

  1. the group launcher will take to run all child tasks sequentially instead of delegating to the grid
  2. status command will only be done via the group job, instead array can be checked to the single child task

pditommaso avatar Apr 25 '23 16:04 pditommaso

Okay I have refactored the array executor into a process directive. So comparing to the original iteration:

  • The ArrayExecutor logic now resides in ArrayTaskCollector. The TaskProcessor will create a collector and submit jobs to it if the array directive is defined.

  • The ArrayTaskHandler logic now resides in ArrayBatchSubmitter. It simply holds the array tasks and implements the array job submission. The array collector submits an array job by creating an array submitter with a list of task handlers. The submitter will pass itself to each task handler.

  • Tasks are managed by the task monitor exactly as before. When a task is submitted, it will defer to the array submitter if it has one. The wrapper script creation is now separate from the submission for this reason. The array submitter simply waits for all of its tasks to be "submitted" by the monitor, then it submits the array job.

  • The grid executor support is basically unchanged. The GridArrayTaskHandler logic now resides in GridArrayTaskSubmitter

Still need to test SLURM more thoroughly, and extend to other executors. Since I'm not subclassing the task monitor anymore, supporting AWS Batch should be straightforward.

bentsherman avatar Apr 25 '23 20:04 bentsherman

@abhi18av I think this PR is ready for you to test on your cluster. Check the changes PbsExecutor and make sure they look correct. If you run Nextflow with trace logging, you should be able to see the array job scripts that are piped to qsub.

bentsherman avatar Apr 27 '23 20:04 bentsherman

I've tried this and my run fails

ERROR ~ Error executing process > 'sayHello (1)'

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:

  cat << 'LAUNCH_COMMAND_EOF' | sbatch
  #!/bin/bash

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:

  cat << 'LAUNCH_COMMAND_EOF' | sbatch
  #!/bin/bash
  #SBATCH --array 0-3
  #SBATCH -J nf-sayHello_(3)
  #SBATCH -o /Users/pditommaso/Projects/nextflow/work/47/9c8ef025c1284c76eb4474dcd0df92/.command.log
  #SBATCH --no-requeue
  #SBATCH --signal B:USR2@30
  declare -a array=( /Users/pditommaso/Projects/nextflow/work/47/9c8ef025c1284c76eb4474dcd0df92 /Users/pditommaso/Projects/nextflow/work/13/52518a7204908c6e010476f9b88ceb /Users/pditommaso/Projects/nextflow/work/0c/6086f8a39fbe13976e6afa80f02d5e /Users/pditommaso/Projects/nextflow/work/60/8a1642462766ff39efed7136262cda )
  task_dir=${array[$SLURM_ARRAY_TASK_ID]}
  bash $task_dir/.command.run
  LAUNCH_COMMAND_EOF

Command exit status:
  1

Command output:
  sbatch: error: Batch script is empty!

Apart this, I have some concerns going ahead with this approach, see the discussion on https://github.com/nextflow-io/nextflow/pull/3905

pditommaso avatar May 03 '23 08:05 pditommaso

I will debug the SLURM error. I never really got the local SLURM cluster to work properly so I need to use the cloud-based instance. In the meantime, please give a look to the latest changes, as I have significantly reduced the amount of code duplication by abstracting the job submission logic into a separate trait.

bentsherman avatar May 03 '23 15:05 bentsherman

Taking this forward on the PBS cluster now.

abhi18av avatar May 05 '23 08:05 abhi18av

Summary of recent changes:

  • now there is only one TaskArraySubmitter that can submit an array job with any executor (that supports array jobs)
  • the array submitter creates a TaskArray which is just a subclass of TaskRun that also contains the list of task handlers for the array
  • the array submitter only calls submit() on the TaskArray, it doesn't submit it to the executor like other tasks
  • Executors and task handlers can check if the task is actually a TaskArray in order to add array-specific behavior

bentsherman avatar May 05 '23 13:05 bentsherman

Sorry mate, took me a while to the infra setup for testing this PR. Unfortunately I wasn't able to test this on the PBS cluster from CHPC cluster due to the electricity disruptions AKA "load-shedding" and weird login behaviour there 😞

But, I do have access to a commercial SLURM cluster which I used to test the PR as of https://github.com/nextflow-io/nextflow/pull/3892/commits/aaaa35d5821cc648fa85908e16d79e78e56c0c22

Test results

At this point, it seems the refactor done recently leads to an error regarding missing method implementation on the TaskArray class.

  • I relied on the following command for testing the build produced via make pack
nextflow-23.04.0-all-dev run https://github.com/pditommaso/nf-sleep  -c custom.config --times 10 --timeout 5 --forks 10
  • Contents of custom.config
process {
	queue = "$QUEUE_NAME"
	array = 4
}

  • Here's the main error (full stack trace also added)
executor >  slurm (3)
[0f/7ce3ba] process > foo (4) [ 14%] 1 of 7, failed: 1
ERROR ~ Unknown method invocation `getCondaEnv0` on TaskArray type -- Did you mean?
  getCondaEnv

 -- Check '.nextflow.log' file for details


Click here for full Java stack-trace
May-07 12:38:29.509 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'foo (3)'

Caused by:
  No signature of method: nextflow.processor.TaskArray.getCondaEnv0() is applicable for argument types: () values: []
Possible solutions: getCondaEnv(), getContainer()

groovy.lang.MissingMethodException: No signature of method: nextflow.processor.TaskArray.getCondaEnv0() is applicable for argument types: () values: []
Possible solutions: getCondaEnv(), getContainer()
	at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.unwrap(ScriptBytecodeAdapter.java:70)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:80)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:176)
	at nextflow.processor.TaskRun$_getCondaEnv_closure5.doCall(TaskRun.groovy:589)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
	at groovy.lang.Closure.call(Closure.java:412)
	at org.codehaus.groovy.runtime.ConvertedClosure.invokeCustom(ConvertedClosure.java:50)
	at org.codehaus.groovy.runtime.ConversionHandler.invoke(ConversionHandler.java:112)
	at jdk.proxy1/jdk.proxy1.$Proxy32.apply(Unknown Source)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1708)
	at java.base/jdk.internal.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.runtime.callsite.PlainObjectMetaMethodSite.doInvoke(PlainObjectMetaMethodSite.java:48)
	at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrap.invoke(PojoMetaMethodSite.java:198)
	at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:51)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:148)
	at nextflow.processor.TaskRun.getCondaEnv(TaskRun.groovy:589)
	at nextflow.processor.TaskBean.<init>(TaskBean.groovy:127)
	at nextflow.executor.BashWrapperBuilder.<init>(BashWrapperBuilder.groovy:121)
	at nextflow.executor.AbstractGridExecutor.createBashWrapperBuilder(AbstractGridExecutor.groovy:78)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.runtime.callsite.PlainObjectMetaMethodSite.doInvoke(PlainObjectMetaMethodSite.java:48)
	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSiteNoUnwrapNoCoerce.invoke(PogoMetaMethodSite.java:189)
	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.call(PogoMetaMethodSite.java:69)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.call(PogoMetaMethodSite.java:74)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
	at nextflow.executor.GridTaskHandler.createTaskWrapper(GridTaskHandler.groovy:211)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.runtime.callsite.PlainObjectMetaMethodSite.doInvoke(PlainObjectMetaMethodSite.java:48)
	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSiteNoUnwrapNoCoerce.invoke(PogoMetaMethodSite.java:189)
	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:57)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:51)
	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:62)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:185)
	at nextflow.executor.GridTaskHandler.prepareLauncher(GridTaskHandler.groovy:250)
	at nextflow.executor.TaskArraySubmitter.submit(TaskArraySubmitter.groovy:98)
	at nextflow.executor.TaskArraySubmitter.collect(TaskArraySubmitter.groovy:63)
	at nextflow.executor.TaskArraySubmitter$collect.call(Unknown Source)
	at nextflow.executor.GridTaskHandler.submit(GridTaskHandler.groovy:259)
	at nextflow.processor.TaskPollingMonitor.submit(TaskPollingMonitor.groovy:197)
	at nextflow.processor.TaskPollingMonitor.submitPendingTasks(TaskPollingMonitor.groovy:563)
	at nextflow.processor.TaskPollingMonitor.submitLoop(TaskPollingMonitor.groovy:388)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1258)
	at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1047)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1132)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
	at groovy.lang.Closure.call(Closure.java:412)
	at groovy.lang.Closure.call(Closure.java:406)
	at groovy.lang.Closure.run(Closure.java:493)
	at java.base/java.lang.Thread.run(Thread.java:833)

Questions

To me it seems that the current implementation relies on 1-step increments, whereas SLURM and PBS both support the following options for --array

--array=1-7:2 

--array=1,3,5,7

Do you think that it'd be worth extending in future for these use cases and allowing users to simply provide these options as part of clusterOption directive?

abhi18av avatar May 07 '23 17:05 abhi18av

Whoops, forgot to add some changes from another PR. Thanks for testing Abhinav. You're welcome to keep trying, but if you can only access a SLURM cluster then it's not crucial because we also have a SLURM environment.

I saw that you could specify a step in the array index but I don't see how it would be useful here. The array job script just creates a list of tasks and launches a task at a given index.

bentsherman avatar May 08 '23 09:05 bentsherman

Ah, okay - the PBS cluster seems to be up and running again, will take it for another spin later today 👍

abhi18av avatar May 08 '23 12:05 abhi18av

Works on AWS Batch and SLURM ✅

Fusion still needs to be tested more thoroughly.

bentsherman avatar May 08 '23 14:05 bentsherman

Hey Ben,

Could you please share the exact commands you're using for these tests?

I feel I'm missing something crucial here, but when I following doc entry and just add a process.array = 2 in the custom.conf file with the following command

nextflow-23.04.0-all-dev -c custom.conf config https://github.com/pditommaso/nf-sleep  -profile chpc,smpq

The pipeline runs fine but I don't see the .command.run script containing the relevant -J 0-1 entry 🤔

  • Here's what the .command.run header looks like

#!/bin/bash
#PBS -P XYZ
#PBS -N nf-foo_2
#PBS -o /home/asharma1/_scratch/work/22/1ca28f67fba4863cc182160f8326a9/.command.log
#PBS -j oe
#PBS -q smp
#PBS -l select=1:ncpus=24
#PBS -l walltime=96:00:00
NXF_CHDIR=/home/asharma1/_scratch/work/22/1ca28f67fba4863cc182160f8326a9
# NEXTFLOW TASK: foo (2)

  • Here's the resolved config
nextflow-23.04.0-all-dev -c custom.conf config https://github.com/pditommaso/nf-sleep  -profile chpc,smpq
process {
   executor = 'pbspro'
   clusterOptions = '-P XYZ'
   queue = 'smp'
   cpus = 24
   time = '4d'
   container = 'quay.io/nextflow/bash'
   array = 2
}

executor {
   queueSize = 10
}

abhi18av avatar May 08 '23 16:05 abhi18av

In the latest commits I added a log trace message for each array job that is submitted, including the array job work directory. Try again with the latest commit, look for the array job lines (the class is TaskArraySubmitter), go to any array job work directory, and you should see a .command.run and .command.sh with the array job script.

bentsherman avatar May 08 '23 16:05 bentsherman

I added array job support to Google Batch. It works with the hello pipeline, but not a pipeline with input/output files. Haven't figured out the root cause yet.

bentsherman avatar May 26 '23 17:05 bentsherman

Okay, I finally fixed the issues with Google Batch. And as a bonus, Google Batch + Fusion also works, no more hanging 😄

So now I just need to go back and test SLURM...

bentsherman avatar May 31 '23 16:05 bentsherman

Okay, array jobs are working across SLURM, AWS Batch, and Google Batch, with and without Fusion. I updated the first comment with notes, todos, and my e2e pipeline.

For grid executors, we need to add the array index environment var to the container. Not sure the best way to do it. For now I made it work by adding it to docker.envWhitelist in my config.

I would also like to find a way to deal with tasks with dynamic resources based on the attempt. Because it is a common practice, we should try to accommodate it in the array job design. I think we should either:

  1. separate arrays by attempt
  2. submit all tasks with attempt > 1 directly instead of through array jobs

I'm leaning towards (2).

bentsherman avatar May 31 '23 20:05 bentsherman

Hi @bentsherman. I'm trying to test this on LSF but I can't make it work.

I'm using e2e.nf as per https://github.com/nextflow-io/nextflow/pull/3892#issue-1678734100 and a lsf.config that is suitable for my cluster:

process {
    executor = 'lsf'
    queue = 'normal'
    time = 1.h
    memory = 4.GB
    cpus = 1
}

executor {
    name = 'lsf'
    perJobMemLimit = true
    poolSize = 4
    submitRateLimit = '5 sec'
    killBatchSize = 50
}

I compiled the code as per https://github.com/nextflow-io/nextflow/issues/1477#issuecomment-1549731500 and the command is ./launch.sh run e2e.nf -c lsf.config

First of all, if I comment out the array directives, the pipeline runs without job arrays. I can see the LSF jobs being spawned and running.

1. With the array directive in the bar process only.

[cd/0676d9] process > foo (50) [100%] 50 of 50 ✔
[d0/d625b7] process > bar (50) [100%] 5 of 5

All the foo jobs run normally (non-array), but Nextflow seems confused about the bar processes. It only counts 5 jobs instead of 50. 5 jobs are submitted to LSF:

3141364 mm49    RUN   normal     tol-1-8-4   tol-1-3-4   nf-bar_(1) Jun 10 16:34
3141365 mm49    RUN   normal     tol-1-8-4   tol-1-3-4   nf-bar_(11) Jun 10 16:34
3141366 mm49    RUN   normal     tol-1-8-4   tol-1-3-4   nf-bar_(21) Jun 10 16:34
3141367 mm49    RUN   normal     tol-1-8-4   tol-1-3-4   nf-bar_(31) Jun 10 16:34
3141368 mm49    RUN   normal     tol-1-8-4   tol-1-3-4   nf-bar_(41) Jun 10 16:34

I can see in the .command.* that each bar job is meant to be an array that encompasses 10 actual jobs, but none is an actual array. Nextflow hangs, and when I ctrl-c it, it can see this in .nextflow.log:

- cmd executed: bkill 3141364[1] 3141364[2] 3141364[3] 3141364[4] 3141364[5] 3141364[6] 3141364[7] 3141364[8] 3141364[9] 3141365[1] 3141365[2] 3141365[3] 314136
- exit status : 255
- output      :
  Job <3141364[1-9:1]>: No matching job found
  Job <3141365[1-9:1]>: No matching job found
  Job <3141366[1-9:1]>: No matching job found
  Job <3141367[1-9:1]>: No matching job found
  Job <3141368[1-9:1]>: No matching job found

Perhaps [1-9:1] is missing from the submission command ?

2. With the array directive in the foo process only.

[df/b619be] process > foo (41) [ 10%] 5 of 50
[-        ] process > bar      -

5 jobs are submitted, but again they're not arrays, but also they complete almost immediately and then Nextflow hangs and I have to ctrl-c it. The log shows the same array-aware bkill command and error messages about "No matching job found"

muffato avatar Jun 10 '23 15:06 muffato

Taking a closer look at the LSF documentation, it looks like the index range cannot start at 0 🤦

Valid values are unique positive integers.

But the main issue is actually that LSF uses the same CLI option for the job name and the index range, so I need to handle that logic properly.

I will make some adjustments...

bentsherman avatar Jun 13 '23 20:06 bentsherman