thrill
thrill copied to clipboard
Issues running in a Slurm cluster
I am running a Thrill application in my university's cluster using run/slurm/invoke.sh
. It works, but getting it to work had me face and resolve the following issues:
- The script
slurm_hostlist.sh
is supposed to expand the$SLURM_JOB_NODELIST
variable to a list ofhost:port
strings for Thrill. It uses an undocumented script calledexpandnodes
- however, that script doesn't exist on our cluster. I fixed this by using the followingsed
-based solution instead:
# apply regex to convert node list to a bash range expression
NODES=`echo $SLURM_JOB_NODELIST | sed 's/\[\(.*\)-\(.*\)\]$/{\1..\2}/g'`
# expand to stdout
eval echo $NODES
- The second issue is that port numbers need to be appended. I guess that's what the
map_ib0.awk
script is meant to do (with hardcoded IP addresses?), however, it only yields an empty list for any input I feed to it. Since I'm not familiar with awk scripts at all, I didn't put any effort in debugging it. Instead, I modifiedinvoke.sh
as follows, essentially hardcoding the port number to 51000 for any given host:
THRILL_HOSTLIST=""
for HOST in $(${slurm}/slurm_hostlist.sh); do
THRILL_HOSTLIST="$THRILL_HOSTLIST $HOST:51000"
done
I am opening this issue to ask whether I should create a pull request with my changes - or what modifications I should do to my changes before creating a pull request. This also burns down to the question what exactly map_ib0.awk
is supposed to do and if it is really needed, because for me, the solution above is working fine (without the awk script).
The problem is each slurm cluster seems to be set up slightly different.
Yes the expandnodes
is undocumented, but necesary for our cluster to expand strings like "ic1h{124-130}" and even weirder strings.
The map_ib0.awk
is used to map Ethernet IPs to Infiniband IPs.
Usually it is just easier to run Thrill programs using MPI (without any scripts).