flux-core
flux-core copied to clipboard
design shell plugin for file broadcast
Following up on a coffee time discussion with @jameshcorbett
There is a need for an C API for copying file(s) to a job. (What other requirements are there?)
Some notes from the discussion
- a shell plugin could implement a distributed file copy service
- plugins can register job-specific service "methods"
- plugins can also get plugin specific options from the jobspec attributes (which can be set on
flux minicommand line) - plugins can address RPCs to shell ranks within the job
- see the PMI plugin for an example of a virtual TBON (in this case to gather PMI keys)
- the job specific service name can be found in the job eventlog (see MPIR plugin for an example)
- files should be sent in chunks (say 4K) to avoid head of line blocking in the broker
- it may be handy to use the per-job tmpdir established by the tmpdir job plugin (FLUX_JOB_TMPDIR environment variable should be set in the job)
Additional thoughts:
src/common/libutil/kary.hprovides some helper function for determining virtual TBON peers etc.- Use a streaming RPC to send the chunks to the destination.
- Note that the shells are only running while the job is running, so this wouldn't work for stage-in while the job is still pending
- The service could actually be the same on each shell rank, with the semantics of "copy to local file and TBON subtree".
Couple of more thoughts:
- Need to handle open errors on remote files (e.g. file exists)
- Need to handle write errors to remote files (e.g. file system full).
- Errors should be propagated back to the API
- A
flux_future_tbased interface is desirable to enable reactive programming - May want a way to pass open flags into the API (e.g. O_TRUNC)
- API should have a way to select a subset of shell ranks as targets
- API should have a way to select any destination path (e.g. escape FLUX_JOB_TMPDIR if desired)
- Idea: option to let service select the destination path and set a configurable environment variable to point to it
I hate to even mention this because I really like the idea of having this capability in Flux, but I think this problem could also be solved with mpifileutils or similar with "alloc bypass" from #3740, copying R from the target job. Two advantages of that approach are 1) uses RDMA, and 2) portability.
Following up on coffee call. It doesn't look like mpifileutils dbcast works like I thought it did. It appears to read stripes of a file from all ranks of a parallel job, not one rank. This is what @jameshcorbett was saying and I just wasn't getting it. Sorry about that!
Just something I was thinking about: Slurm's sbcast works on both job IDs and step IDs. The proposed implementation, as a job shell plugin, would work only on jobs and wouldn't have a way to broadcast a file to every node in a Flux instance. But you could go up a level in the Flux hierarchy and broadcast the file at that level, to the job that is the sub-instance. There wouldn't be a way to broadcast a file across a top-level Flux instance but I can't think of any use-cases for that.
To replicate the sbcast example in Flux:
$ cat my.job
#!/bin/bash
sbcast my.prog /tmp/my.prog
srun /tmp/my.prog
$ sbatch --nodes=8 my.job
srun: jobid 12345 submitted
You would need to be able to get the job ID of the encapsulating instance and the URI of the system instance. A little awkward, maybe...
The proposed implementation, as a job shell plugin, would work only on jobs and wouldn't have a way to broadcast a file to every node in a Flux instance
Most Flux instances are also jobs, but I think I understand what you are saying here: You can't broadcast a file to all nodes of your single-user enclosing instance (i.e. in most cases, your batch job) from within the instance.
You would need to be able to get the job ID of the encapsulating instance and the URI of the system instance. A little awkward, maybe...
Actually, this may not be too bad. Within an instance started under Flux, the environment variable FLUX_JOB_ID will be set to the jobid of the current instance. The flux(1) command driver also has a --parent option which uses the URI of the parent instead of the current instance. So, if flux bcast is the command to broadcast a file to all nodes of a job, your batch script could use:
flux --parent bcast $FLUX_JOB_ID /tmp/my.prog
Better yet, if a JOBID isn't provided with flux bcast, maybe the utility could assume it is meant to run against the current job and will automatically use the current FLUX_JOB_ID and grab the parent-uri from the enclosing instance so it will work similarly to sbcast:
flux bcast /tmp/my.prog
Most Flux instances are also jobs, but I think I understand what you are saying here.
The problem with all this infinitely hierarchical stuff is that it makes everything hard to talk about :(
Actually, this may not be too bad.
Great, I figured that there would be good ways of talking to the parent instance, but I didn't know what they were (or if they had already been implemented). I also really like your idea of the missing JOBID assumption.
Better yet, if a JOBID isn't provided with flux bcast, maybe the utility could assume it is meant to run against the current job and will automatically use the current FLUX_JOB_ID and grab the parent-uri from the enclosing instance.
I like the idea of looking for FLUX_JOB_ID, but grabbing the parent-uri does have one drawback. If you executed something like flux mini run ... bash -c "flux bcast /tmp/my.prog; /tmp/my.prog", FLUX_JOB_ID would be set to the bash job (rather than the job ID of the current instance), so the combination of parent-uri and FLUX_JOB_ID would be all wrong.
Just a potential trade-off to be aware of. I can't think of any possible confusions from letting the job ID be implicit, though.
If you executed something like flux mini run ... bash -c "flux bcast /tmp/my.prog; /tmp/my.prog", FLUX_JOB_ID would be set to the bash job (rather than the job ID of the current instance)
Good point!
Though running flux bcast in this way should perhaps be avoided because:
- If your
flux mini runspecifies multiple tasks you'll be runningflux bcastmultiple times simultaneously - If your
flux mini runonly specifies one task then you are runningflux bcastto copy a file to itself on the local node
It would be nice if we had a way to detect this situation and issue a meaningful error. :thinking:
If you wanted to run flux bcast as a job, e.g. to use it as part of a workflow, then you could use the FLUX_JOB_ID from the environment at the time of submission, and specifically use --parent, though that isn't so user-friendly:
flux mini submit flux --parent bcast $FLUX_JOB_ID
Sounds like it would be good if we had a way to determine if the current process is in an initial program i.e. batch script, or part of a job. In the 2nd case you could maybe issue an error if JOBID isn't provided.
@JaeseungYeom do you think you could leverage some of your DYAD work for this?
Closing this. We can open issues against flux-archive(1) if there are still things missing.