launcher icon indicating copy to clipboard operation
launcher copied to clipboard

Launcher won't run in parallel on different cluster

Open rsbrennan opened this issue 3 years ago • 1 comments

I'm having issues using launcher in a different slurm cluster. I have things running on TACC no problem.

After installing I can successfully run simple jobs that aren't running in parallel. That is, they run sequentially.

I run into problems with specifying LAUNCHER_RMI=SLURM. Specifically, when I try to run jobs in parallel, it hangs forever and repeatedly prints the attached error found here: launcher_error.txt. Note that this is only one instance of the error, which will be repeated until the job times out.

The error is stemming from line 308 in the paramrun file, when trying to autoretry the ssh submission of each job. The jobs are never submitted. It is possible that this problem is specific to the design of the cluster I'm using (at Michigan State Univ). I'm curious if others have successfully used launcher elsewhere and/or if there are any tips to getting things running.

This isn't an issue with my job scripts as they run fine on TACC.

The job file echos hello world and my launcher file is below:

#!/bin/bash

#SBATCH -J ustacks_launcher
#SBATCH --mem 250M
#SBATCH -n 10
#SBATCH -N 1
#SBATCH -o test_%j.out
#SBATCH -e test_%j.err
#SBATCH -t 00:10:00

#------------------------------------------------------

export LAUNCHER_DIR=~/launcher
export LAUNCHER_WORKDIR=`pwd`
export LAUNCHER_JOB_FILE=default_work_file
export LAUNCHER_RMI=SLURM

$LAUNCHER_DIR/paramrun

rsbrennan avatar Jan 20 '21 21:01 rsbrennan

Hi @rsbrennan, what happens if you just comment out line 308 and uncomment line 310? Does that fix the issue for that cluster?

AJVincelli avatar Mar 07 '21 17:03 AJVincelli