flux-core
flux-core copied to clipboard
support MPMD execution
Rich Drake for SNL's SIERRA code:
Searching Sierra's launch scripts, I only see the use of these:
-n
--multi-prog
And the ":" delimiter for MPMD execution.
Of course, if binding needs to be specified, then we will use that.
Seems MPMD support is a gap... any idea how easy or difficult to match this srun option?
Originally posted by @dongahn in https://github.com/flux-framework/flux-core/issues/2150#issuecomment-492781807
Other comments from that thread on SIERRA's specific use of MPMD:
We use --multi-prog on ATS-1 for all our MPMD-type coupled executions (I think all examples are two codes in a partitioned core set). If there was a way to run these without using --multi-prog, I'm sure we could make that work.
@SteVwonder:
@garlick and I just chatted briefly about that. We could probably mimic the exact behavior with a wreck plugin, but we could also leverage nested instances to achieve the co-scheduling. In the nested instance case, we (or the user) would just need to make sure that the total amount of work submitted to the sub-instance is not greater than the resources allocated to the instance can handle (i.e., ntasks cannot be greater than ncores). We could probably wrap that logic up in a flux multiprog command, or even add it as a flag to flux capacitor
@dongahn:
Yeah it seems like Rich is alluding that he can make use of a general co-scheduling capability. For oneoff option like this, it would be wise to invite the users like Rich to firm up our solution as co-design. Looks like we will have to divide and conquer across different SNL teams a bit for effective communications going forward.
@garlick:
Great info we are accumulating here. It is sort of difficult to decide what options to support. The goal is to provide a stable porting target not an srun clone.
My suggestion is to start with the options that are supported in flux jobspec and make a super simple wrapper for the plumbing commands in master, and backport those to a 0.11 script that translates to wreckrun options.
Possibly this will help us identify some missing plumbing in master for synchronization, I/O, etc. that will be good short term work items.
In a recent concall with the Radical Pilot folks, they expressed interest in being able to specify whether the various programs are a part of a single global communicator or if each program has its own communicator. I'm not sure how feasible that latter bit is, but I at least wanted to document it before forgetting.
We've had another user request this support. It seems a first step could be to support a wrapper script that reads a "config file" and can be used as a wrapper script, e.g. flux run OPTIONS... flux multi-prog file.conf.
Later, if users request a more integrated approach, we could add a new submission command flux mprun OPTIONS... file.conf, which could stash the conf in jobspec, and modify the executable to our wrapper script.
Long term, the jobspec tasks section does allow specification of multiple command lines, but it is yet to be determined how that section would be used for MPMD launches. Especially if we were trying to support mapping ranks to command lines as is done in the Slurm --multi-prog configuration.
As an example of the wrapper script, here's a quick draft of an mprun.py script which can read the Slurm conf file syntax (including substitution of %t and %o)
#!/usr/bin/env python3
###############################################################
# Copyright 2023 Lawrence Livermore National Security, LLC
# (c.f. AUTHORS, NOTICE.LLNS, COPYING)
#
# This file is part of the Flux resource manager framework.
# For details, see https://github.com/flux-framework.
#
# SPDX-License-Identifier: LGPL-3.0
###############################################################
import os
import re
import shlex
import sys
import flux
from flux.idset import IDset
class MultiProgLine:
"""Class representing a single "multi-prog" config line"""
def __init__(self, value, lineno=-1):
self.ranks = IDset()
self.all = False
self.args = []
self.lineno = lineno
lexer = shlex.shlex(value, posix=True, punctuation_chars=True)
lexer.whitespace_split = True
lexer.escapedquotes = "\"'"
try:
args = list(lexer)
except ValueError as exc:
raise ValueError(f"line {lineno}: {value}: {exc}") from None
if not args:
return
targets = args.pop(0)
if targets == "*":
self.all = True
else:
self.ranks = IDset(targets)
self.args = args
def get_args(self, rank):
"""Return the arguments list with %t and %o substituted for `rank`"""
result = []
index = 0
if not self.all:
index = self.ranks.expand().index(rank)
sub = {"%t": str(rank), "%o": str(index)}
for arg in self.args:
result.append(re.sub(r"(%t)|(%o)", lambda x: sub[x.group(0)], arg))
return result
def __bool__(self):
return bool(self.args)
def __str__(self):
return f"{self.ranks}: {self.args}"
class MultiProg:
"""Class representing an entire "multi-prog" config file"""
def __init__(self, inputfile):
self.lines = []
lineno = 0
for line in inputfile:
lineno += 1
mpline = MultiProgLine(line, lineno)
if mpline:
self.lines.append(mpline)
def find(self, rank):
"""Return line matching 'rank' in the current config"""
for line in self.lines:
if line.all or rank in line.ranks:
return line
raise ValueError(f"No matching line for rank {rank}")
def exec(self, rank, dry_run=False):
"""Exec configured command line arguments for a task rank"""
args = self.find(rank).get_args(rank)
if dry_run:
print(" ".join(args))
else:
os.execvp(args[0], args)
with open(sys.argv[1]) as infile:
mp = MultiProg(infile)
try:
rank = int(os.getenv("FLUX_TASK_RANK"))
except TypeError:
raise ValueError("FLUX_TASK_RANK environment variable not found or invalid")
mp.exec(rank)
This could be used as flux python mprun.py test.conf, e.g.:
$ cat test.conf
###################################################################
# srun multiple program configuration file
#
# srun -n8 -l --multi-prog silly.conf
###################################################################
4-6 hostname
1,7 echo task:%t
0,2-3 echo offset:%o
* echo all task=%t
$ flux run -n20 --label-io flux python mprun.py test.conf
16: all task=16
17: all task=17
5: pi0
4: pi0
18: all task=18
1: task:1
7: task:7
0: offset:0
19: all task=19
6: pi0
13: all task=13
12: all task=12
9: all task=9
8: all task=8
10: all task=10
14: all task=14
15: all task=15
11: all task=11
2: offset:1
3: offset:2
More context for this use case from a user:
We use the multi-prog feature to execute two different codes in the same MPI allocation; these codes communicate through MPI to couple the two codes together (for fluid-structure interaction). We have not used this capability for GPUs, but would like to be able to do that as well.
We have successfully done this coupling using srun –multi-prog, as well as the mpirun equivalent (FAQ: Running MPI jobs (open-mpi.org)).
We just had someone asking about MPMD functionality (running multiple executables in the same MPI_COMM_WORLD) on rzvernal (https://llnl.servicenowservices.com/nav_to.do?uri=incident.do?sys_id=d9e2946097f1465484bdb546f053afea). Has anything come of this issue?
No, there is no native support in jobspec V1 for MPMD, but the script above should work using a Slurm --mulit-prog input file. The user can copy this script to mprun.py, then flux submit flux python mprun.py mprun.conf.
Actually, I haven't tested that script since it was posted, let me double-check it still works.
Edit: Just verified this workaround still seems to work.