flux-core support MPMD execution

Rich Drake for SNL's SIERRA code:

Searching Sierra's launch scripts, I only see the use of these:

-n

--multi-prog

And the ":" delimiter for MPMD execution.

Of course, if binding needs to be specified, then we will use that.

Seems MPMD support is a gap... any idea how easy or difficult to match this srun option?

Originally posted by @dongahn in https://github.com/flux-framework/flux-core/issues/2150#issuecomment-492781807

Sep 30 '19 13:09 garlick

Other comments from that thread on SIERRA's specific use of MPMD:

We use --multi-prog on ATS-1 for all our MPMD-type coupled executions (I think all examples are two codes in a partitioned core set). If there was a way to run these without using --multi-prog, I'm sure we could make that work.

@SteVwonder:

@garlick and I just chatted briefly about that. We could probably mimic the exact behavior with a wreck plugin, but we could also leverage nested instances to achieve the co-scheduling. In the nested instance case, we (or the user) would just need to make sure that the total amount of work submitted to the sub-instance is not greater than the resources allocated to the instance can handle (i.e., ntasks cannot be greater than ncores). We could probably wrap that logic up in a flux multiprog command, or even add it as a flag to flux capacitor

@dongahn:

Yeah it seems like Rich is alluding that he can make use of a general co-scheduling capability. For oneoff option like this, it would be wise to invite the users like Rich to firm up our solution as co-design. Looks like we will have to divide and conquer across different SNL teams a bit for effective communications going forward.

@garlick:

Great info we are accumulating here. It is sort of difficult to decide what options to support. The goal is to provide a stable porting target not an srun clone.

My suggestion is to start with the options that are supported in flux jobspec and make a super simple wrapper for the plumbing commands in master, and backport those to a 0.11 script that translates to wreckrun options.

Possibly this will help us identify some missing plumbing in master for synchronization, I/O, etc. that will be good short term work items.

Sep 30 '19 14:09 SteVwonder

In a recent concall with the Radical Pilot folks, they expressed interest in being able to specify whether the various programs are a part of a single global communicator or if each program has its own communicator. I'm not sure how feasible that latter bit is, but I at least wanted to document it before forgetting.

Sep 28 '20 20:09 SteVwonder

We've had another user request this support. It seems a first step could be to support a wrapper script that reads a "config file" and can be used as a wrapper script, e.g. flux run OPTIONS... flux multi-prog file.conf.

Later, if users request a more integrated approach, we could add a new submission command flux mprun OPTIONS... file.conf, which could stash the conf in jobspec, and modify the executable to our wrapper script.

Long term, the jobspec tasks section does allow specification of multiple command lines, but it is yet to be determined how that section would be used for MPMD launches. Especially if we were trying to support mapping ranks to command lines as is done in the Slurm --multi-prog configuration.

As an example of the wrapper script, here's a quick draft of an mprun.py script which can read the Slurm conf file syntax (including substitution of %t and %o)

#!/usr/bin/env python3
###############################################################
# Copyright 2023 Lawrence Livermore National Security, LLC
# (c.f. AUTHORS, NOTICE.LLNS, COPYING)
#
# This file is part of the Flux resource manager framework.
# For details, see https://github.com/flux-framework.
#
# SPDX-License-Identifier: LGPL-3.0
###############################################################

import os
import re
import shlex
import sys

import flux
from flux.idset import IDset


class MultiProgLine:
    """Class representing a single "multi-prog" config line"""

    def __init__(self, value, lineno=-1):
        self.ranks = IDset()
        self.all = False
        self.args = []
        self.lineno = lineno
        lexer = shlex.shlex(value, posix=True, punctuation_chars=True)
        lexer.whitespace_split = True
        lexer.escapedquotes = "\"'"
        try:
            args = list(lexer)
        except ValueError as exc:
            raise ValueError(f"line {lineno}: {value}: {exc}") from None
        if not args:
            return

        targets = args.pop(0)
        if targets == "*":
            self.all = True
        else:
            self.ranks = IDset(targets)
        self.args = args

    def get_args(self, rank):
        """Return the arguments list with %t and %o substituted for `rank`"""

        result = []
        index = 0
        if not self.all:
            index = self.ranks.expand().index(rank)
        sub = {"%t": str(rank), "%o": str(index)}
        for arg in self.args:
            result.append(re.sub(r"(%t)|(%o)", lambda x: sub[x.group(0)], arg))
        return result

    def __bool__(self):
        return bool(self.args)

    def __str__(self):
        return f"{self.ranks}: {self.args}"


class MultiProg:
    """Class representing an entire "multi-prog" config file"""

    def __init__(self, inputfile):
        self.lines = []
        lineno = 0
        for line in inputfile:
            lineno += 1
            mpline = MultiProgLine(line, lineno)
            if mpline:
                self.lines.append(mpline)

    def find(self, rank):
        """Return line matching 'rank' in the current config"""
        for line in self.lines:
            if line.all or rank in line.ranks:
                return line
        raise ValueError(f"No matching line for rank {rank}")

    def exec(self, rank, dry_run=False):
        """Exec configured command line arguments for a task rank"""
        args = self.find(rank).get_args(rank)
        if dry_run:
            print(" ".join(args))
        else:
            os.execvp(args[0], args)


with open(sys.argv[1]) as infile:
    mp = MultiProg(infile)

try:
    rank = int(os.getenv("FLUX_TASK_RANK"))
except TypeError:
    raise ValueError("FLUX_TASK_RANK environment variable not found or invalid")

mp.exec(rank)

This could be used as flux python mprun.py test.conf, e.g.:

$ cat test.conf
###################################################################
# srun multiple program configuration file
#
# srun -n8 -l --multi-prog silly.conf
###################################################################
4-6       hostname
1,7       echo  task:%t
0,2-3     echo  offset:%o
*         echo  all task=%t
$ flux run -n20 --label-io flux python mprun.py test.conf 
16: all task=16
17: all task=17
5: pi0
4: pi0
18: all task=18
1: task:1
7: task:7
0: offset:0
19: all task=19
6: pi0
13: all task=13
12: all task=12
9: all task=9
8: all task=8
10: all task=10
14: all task=14
15: all task=15
11: all task=11
2: offset:1
3: offset:2

May 30 '23 17:05 grondo

More context for this use case from a user:

We use the multi-prog feature to execute two different codes in the same MPI allocation; these codes communicate through MPI to couple the two codes together (for fluid-structure interaction). We have not used this capability for GPUs, but would like to be able to do that as well.

We have successfully done this coupling using srun –multi-prog, as well as the mpirun equivalent (FAQ: Running MPI jobs (open-mpi.org)).

May 31 '23 01:05 grondo

We just had someone asking about MPMD functionality (running multiple executables in the same MPI_COMM_WORLD) on rzvernal (https://llnl.servicenowservices.com/nav_to.do?uri=incident.do?sys_id=d9e2946097f1465484bdb546f053afea). Has anything come of this issue?

Apr 22 '24 21:04 ryanday36

No, there is no native support in jobspec V1 for MPMD, but the script above should work using a Slurm --mulit-prog input file. The user can copy this script to mprun.py, then flux submit flux python mprun.py mprun.conf.

Actually, I haven't tested that script since it was posted, let me double-check it still works.

Edit: Just verified this workaround still seems to work.

Apr 22 '24 22:04 grondo

flux-core flux-core copied to clipboard

support MPMD execution

flux-core
flux-core copied to clipboard