thrill icon indicating copy to clipboard operation
thrill copied to clipboard

Issues running in a Slurm cluster

Open pdinklag opened this issue 6 years ago • 1 comments

I am running a Thrill application in my university's cluster using run/slurm/invoke.sh. It works, but getting it to work had me face and resolve the following issues:

  1. The script slurm_hostlist.sh is supposed to expand the $SLURM_JOB_NODELIST variable to a list of host:port strings for Thrill. It uses an undocumented script called expandnodes - however, that script doesn't exist on our cluster. I fixed this by using the following sed-based solution instead:
# apply regex to convert node list to a bash range expression
NODES=`echo $SLURM_JOB_NODELIST | sed 's/\[\(.*\)-\(.*\)\]$/{\1..\2}/g'`
# expand to stdout
eval echo $NODES
  1. The second issue is that port numbers need to be appended. I guess that's what the map_ib0.awk script is meant to do (with hardcoded IP addresses?), however, it only yields an empty list for any input I feed to it. Since I'm not familiar with awk scripts at all, I didn't put any effort in debugging it. Instead, I modified invoke.sh as follows, essentially hardcoding the port number to 51000 for any given host:
THRILL_HOSTLIST=""
for HOST in $(${slurm}/slurm_hostlist.sh); do
    THRILL_HOSTLIST="$THRILL_HOSTLIST $HOST:51000"
done

I am opening this issue to ask whether I should create a pull request with my changes - or what modifications I should do to my changes before creating a pull request. This also burns down to the question what exactly map_ib0.awk is supposed to do and if it is really needed, because for me, the solution above is working fine (without the awk script).

pdinklag avatar Sep 30 '18 10:09 pdinklag

The problem is each slurm cluster seems to be set up slightly different. Yes the expandnodes is undocumented, but necesary for our cluster to expand strings like "ic1h{124-130}" and even weirder strings. The map_ib0.awk is used to map Ethernet IPs to Infiniband IPs. Usually it is just easier to run Thrill programs using MPI (without any scripts).

bingmann avatar Oct 02 '18 08:10 bingmann