nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Error in nxf_kill

Open olivierlabayle opened this issue 8 months ago • 0 comments

Bug report

I have created a minimal example regarding a persistent error resulting in pipeline crashes on SGE associated with the generated nxf_kill function in .command.run. I have attached two files to reproduce it consistently on my cluster, a test.nf file and a nextflow.config file. Specifically the error always points to line 43 of the script:

children[$PP]+=" $P"

in

nxf_kill() {
    declare -a children
    while read P PP;do
        children[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    kill_all() {
        [[ $1 != $$ ]] && kill $1 2>/dev/null || true
        for i in ${children[$1]:=}; do kill_all $i; done
    }

    kill_all $1
}

Expected behavior and actual behavior

The workflow consists of a single process that takes 15 seconds to complete (basically a sleep 15 and creation of a dummy file). I schedule 500 of these processes using Nextflow and a time limit of: '10s' * task.attempt. Notably this limit should result in a retry (exit 140) on the first process execution and complete on either the second or third attempt. However, an exit status 1 is thrown occasionally resulting in workflow crashes.

Steps to reproduce the problem

  • Use the latest Nextflow version 24.04.2.
  • Copy the two files provided anywhere in the same directory
  • run: nextflow run test.nf

Program output (.command.log content)

Signal 12 (USR2) caught by ps (procps-ng version 3.3.10)
/var/spool/gridscheduler/execd/node2d21/job_scripts/44417159: line 43: 1 0: syntax error in expression (error token is "0")

Environment

  • Nextflow version: 24.04.2
  • Java version: openjdk version "17.0.6" 2023-01-17 LTS
  • Operating system: Linux
  • Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

Additional context

  • Files to reproduce:

nextflow_issue.zip

  • Suggestions from chat GPT:

I have asked chatGPT about the error, sorry if this is completely stupid but it might help so I include it just in case:

nxf_kill() {
    declare -A children

    while read -r P PP; do
        # Check if P and PP are integers
        if [[ $P =~ ^[0-9]+$ && $PP =~ ^[0-9]+$ ]]; then
            children[$PP]+=" $P"
        fi
    done < <(ps -e -o pid= -o ppid=)

    kill_all() {
        local pid=$1
        if [[ $pid != $$ ]]; then
            kill "$pid" 2>/dev/null || true
        fi
        for child in ${children[$pid]:=}; do
            kill_all "$child"
        done
    }

    kill_all "$1"
}
  • Use declare -A for associative arrays: This ensures that the children array behaves correctly.
  • Check for integer values before assigning to the array to avoid unexpected values.
  • Use local for the pid variable in the kill_all function to ensure proper scope handling.
  • Add -r option to read to prevent backslash escapes from being interpreted.

olivierlabayle avatar Jun 19 '24 15:06 olivierlabayle