nextflow
nextflow copied to clipboard
Error in nxf_kill
Bug report
I have created a minimal example regarding a persistent error resulting in pipeline crashes on SGE associated with the generated nxf_kill
function in .command.run
. I have attached two files to reproduce it consistently on my cluster, a test.nf file and a nextflow.config file. Specifically the error always points to line 43 of the script:
children[$PP]+=" $P"
in
nxf_kill() {
declare -a children
while read P PP;do
children[$PP]+=" $P"
done < <(ps -e -o pid= -o ppid=)
kill_all() {
[[ $1 != $$ ]] && kill $1 2>/dev/null || true
for i in ${children[$1]:=}; do kill_all $i; done
}
kill_all $1
}
Expected behavior and actual behavior
The workflow consists of a single process that takes 15 seconds to complete (basically a sleep 15 and creation of a dummy file). I schedule 500 of these processes using Nextflow and a time limit of: '10s' * task.attempt. Notably this limit should result in a retry (exit 140) on the first process execution and complete on either the second or third attempt. However, an exit status 1 is thrown occasionally resulting in workflow crashes.
Steps to reproduce the problem
- Use the latest Nextflow version 24.04.2.
- Copy the two files provided anywhere in the same directory
- run: nextflow run test.nf
Program output (.command.log content)
Signal 12 (USR2) caught by ps (procps-ng version 3.3.10)
/var/spool/gridscheduler/execd/node2d21/job_scripts/44417159: line 43: 1 0: syntax error in expression (error token is "0")
Environment
- Nextflow version: 24.04.2
- Java version: openjdk version "17.0.6" 2023-01-17 LTS
- Operating system: Linux
- Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
Additional context
- Files to reproduce:
- Suggestions from chat GPT:
I have asked chatGPT about the error, sorry if this is completely stupid but it might help so I include it just in case:
nxf_kill() {
declare -A children
while read -r P PP; do
# Check if P and PP are integers
if [[ $P =~ ^[0-9]+$ && $PP =~ ^[0-9]+$ ]]; then
children[$PP]+=" $P"
fi
done < <(ps -e -o pid= -o ppid=)
kill_all() {
local pid=$1
if [[ $pid != $$ ]]; then
kill "$pid" 2>/dev/null || true
fi
for child in ${children[$pid]:=}; do
kill_all "$child"
done
}
kill_all "$1"
}
- Use declare -A for associative arrays: This ensures that the children array behaves correctly.
- Check for integer values before assigning to the array to avoid unexpected values.
- Use local for the pid variable in the kill_all function to ensure proper scope handling.
- Add -r option to read to prevent backslash escapes from being interpreted.