flux-core use case: SCR needs allocations to tolerate single node failure

A job that needs many nodes has a higher likelihood that a node will fail during the run. A user may handle this by requesting an extra node (N+1 total nodes). The user launches the job on N nodes, leaving one idle, and the job stores checkpoints periodically. When one of the N nodes fails, the user re-launches the job on the remaining N good nodes (using the formerly spare node), which reads in the latest checkpoint file and restarts.

Without node-local storage or without SCR, this is still desirable to users as requesting a new allocation after the failure may put them well back in the scheduler's queue and delay completion of the job significantly.

With node-local storage and SCR, the checkpoints will be stored on node-local storage. Re-starting the job can be accomplished by calculating the checkpoint data that will be needed on the "formerly spare" node (from the N-1 nodes that are still up) and writing it to node-local storage there, and then re-launching the job. This is much, much faster than copying all the checkpoint data to a global file system and then reading it back on a new set of nodes.

Note that users might actually allocate more than one extra node, this is just an example.

Jul 19 '22 16:07 ofaaland

Currently a node failure is fatal to the allocation, although there is a few-second period after the downed node is drained and before the allocation is lost:

[faaland1@fluke6 ~] $rpm -qa | grep ^flux | sort
flux-accounting-0.17.0-1.t4.x86_64
flux-core-0.41.0-4.t4.x86_64
flux-core-python38-0.41.0-4.t4.x86_64
flux-pam-0.1.0-1.t4.x86_64
flux-sched-0.23.0-1.t4.x86_64
flux-security-0.7.0-1.t4.x86_64
[faaland1@fluke6 ~] $flux resource status
    STATUS NNODES NODELIST
     avail      8 fluke[6-13]

### fluke13 is powered off here
2022-07-19T16:58:49.000465Z broker.err[3]: broker on fluke13 (rank 7) has been unresponsive for 31.9998s

[faaland1@fluke6 ~] $flux resource status
    STATUS NNODES NODELIST
     avail      7 fluke[6-12]
   drained      1 fluke13
[faaland1@fluke6 ~] $flux mini run --ntasks 28 hostname | sort | uniq -c
      4 fluke10
      4 fluke11
      4 fluke12
      4 fluke6
      4 fluke7
      4 fluke8
      4 fluke9
[faaland1@fluke6 ~] $2022-07-19T17:00:20.237149Z broker.err[6]: runat_abort cleanup (signal 15): No such file or directory
2022-07-19T17:00:20.238896Z broker.err[4]: rc3.0: /etc/flux/rc3 Hangup (rc=129) 0.0s
2022-07-19T17:00:20.239783Z broker.err[0]: rc2.0: /bin/bash Hangup (rc=129) 188.1s
199.254s: job.exception type=exec severity=0 lost contact with job shell on broker fluke13 (rank 10)

flux-job: task(s) exited with exit code 129
                                           rsmi_init() failed
                                                             rsmi_init() failed
                                                                               rsmi_init() failed
                                                                                                 rsmi_init() failed
                                                                                                                   rsmi_init() failed
    rsmi_init() failed
                      rsmi_init() failed
                                        E: (flux-broker) 22-07-19 10:00:20 [286841]dangling 'PAIR' socket created at shmem.c:148
                                                                                                                                E: (flux-broker) 22-07-19 10:00:20 [286841]dangling 'PAIR' socket created at shmem.c:148
                                                                                       E: (flux-broker) 22-07-19 10:00:20 [286841]dangling 'PAIR' socket created at shmem.c:148
                                              E: (flux-broker) 22-07-19 10:00:20 [286841]dangling 'PAIR' socket created at shmem.c:148
     E: (flux-broker) 22-07-19 10:00:20 [286841]dangling 'PAIR' socket created at shmem.c:148
                                                                                             E: (flux-broker) 22-07-19 10:00:20 [286841]dangling 'PAIR' socket created at shmem.c:148
[detached: remote pty disappeared]

Jul 19 '22 17:07 ofaaland

@adammoody and @kathrynmohror FYI

Jul 19 '22 17:07 ofaaland

This doesn't help you right now, but to be clear, this is the current intended behavior. The exec system handles "a node crashed that was allocated to a job" by raising a fatal exception on the job. It doesn't do anything different if the job is an MPI job or, as in this case, a flux instance, even though the flux instance has all the same behaviors as the system instance which does handle crashed nodes.

We'll want to explore changing that behavior when the job is a flux instance. One thought is to raise a non-fatal exception on the job, followed by a fatal one if the job doesn't respond to it with "I want to live" within a period of time. That's just one idea, there may be other better approaches.

Related considerations:

we'll want a way to increase the TBON fanout from the default 1:2 in large, long running jobs (we use 1:256 in the system instance). Losing a router node currently causes loss of its subtree
possibly a good spares strategy given the above is to make the router nodes the spares and drain them by default, with the idea that they'll be less crashy without a user workload

Jul 19 '22 18:07 garlick

Thanks @ofaaland and @garlick , in SLURM lingo, we're looking for the same type of functionality that we get from sbatch --no-kill.

Jul 19 '22 18:07 adammoody

Thanks @garlick.

we'll want a way to increase the TBON fanout ... Losing a router node currently causes loss of its subtree

Are there plans to make the tree tolerate loss of a router node at some point in the future?

Jul 19 '22 18:07 ofaaland

No plans in the near term, although our priorities will no doubt shift around as we get feedback during rollout.

Jul 19 '22 18:07 garlick

@trws @garlick I'm following up on @kathrynmohror 's conversation with Tom.

If I understand correctly, fully satisfying this use case requires multiple changes to flux. Is that correct? And is the list below a reasonable description of the changes required?

a. alter flux so node failure doesn't always result in a fatal exception, without changing anything about the TBON. b. alter flux so the user can change the TBON fanout to reduce the odds that the failed node is a router node c. alter flux so the TBON can tolerate router node failure

If that's correct, is designing and implementing only (a), or (a) and (b), an option worth considering?

That would allow us to test and confirm SCR handles node failure correctly under Flux, at least under the easy scenarios.

When we start getting El Capitan, nodes will be constantly failing until we have weeded out bad hardware and worked through software and configuration issues. So even partially satisfying this use case could significantly improve the user experience when the cluster is young and fragile.

Thanks

Sep 14 '22 22:09 ofaaland

(b) is already possible by setting `--broker-opts="-Stbon.fanout=N" on a batch job.

(c) is IMHO too challenging to work on in the El Capitan time frame, given everything else that is happening with Flux.

So I'd probably start breaking down (a) here.

Sep 14 '22 22:09 garlick

Cool, thanks Jim.

Sep 14 '22 22:09 ofaaland

So I'd probably start breaking down (a) here.

Currently, the job-exec module in the parent monitors for lost connections to job shells (i.e. when the exec protocol gets a EHOSTUNREACH error) and terminates the job when this occurs. @garlick earlier had the idea that jobs that are also Flux instances should be allowed to decide when they should terminate, since the child instance is also monitoring the liveness of its own brokers. In the normal case, the running normal jobs in the child instance would be terminated with a fatal exception when its connection to the job shell was lost, then presumably the workload would exit with failure and the job would terminate gracefully. If any deeper level jobs are also instances, that same scenario would play out down through all the turtles.

So, we need a way to enable "free range parenting" mode for Flux instances in contrast to "helicopter parenting mode" for normal jobs.

It would be trivial to add an option to avoid the fatal job exception on EHOSTUNREACH (instead perhaps a non-fatal exception could be used), it would be interesting to see if the job-exec module would then handle normal termination of the rest of the job shells, or if some further changes would be needed there. If not, this could be a pretty trivial change.

Of course, in addition to free-range-parent mode, flux mini batch and flux mini alloc could also automatically set tbon.fanout=0 to get a flat tree (maybe due to that, this mode of operation should be opt-in only)

Sep 15 '22 02:09 grondo

@ofaaland, as an experiment I added support for a new jobspec attribute exec.ignore-lost-ranks which, when set, causes the job-exec module to raise a non-fatal exception when it loses contact with a job shell instead of a fatal exception. This is the very first baby step to allowing child instances to determine their own destiny.

The branch is at grondo/issue#4573 and is meant to be something we can experiment with to see how far this gets us.

Testing without actually powering off nodes is tricky, but here is what I did:

Start a new instance that will serve as a representative system instance with this feature:

$ flux mini run -o pty.interactive -o exit-timeout=none -N4 src/cmd/flux start -o,-Stbon.fanout=0

Start an child instance under that instance to represent the fault-tolerant batch job:

$ flux mini run -o pty.interactive -o exit-timeout=none --setattr exec.ignore-lost-ranks -N4 src/cmd/flux start -o,-Stbon.fanout=0
$ flux uptime
18:47:40 run 5.1s,  owner grondo,  depth 2,  size 4
$ flux overlay status
0 corona171: full
├─ 1 corona173: full
├─ 2 corona174: full
└─ 3 corona175: full

Start a job in the depth=2 instance to simulate actual work:

$ flux mini run -N4 sleep inf

Now, to simulate a node failure in the parent instance of this job, we can kill a broker from the parent instance. However, this will cause a cascading failure in the child broker and job shells, so it is best to just kill everything on the node to avoid inadvertently getting a fatal job exception for some other reason (There may actually be a better way to do this, but this works for now)

$ pstree -Tplu grondo
flux-shell(190854)───flux-broker-3(190856)───lt-flux-shell(191171)───flux-broker-3(191185)
$ kill -9 190856 191171 191185

In my terminal running the sleep job, I get no dearth of errors:

$ $ flux mini run -N4 sleep inf
2022-09-16T01:52:21.308822Z broker.err[0]: corona175 (rank 3) transitioning to LOST due to EHOSTUNREACH error on send
26.684s: job.exception type=exec severity=0 lost contact with job shell on broker corona175 (rank 3)
2022-09-16T01:52:21.593552Z broker.err[0]: corona175 (rank 3) transitioning to LOST due to EHOSTUNREACH error on send
                                    286.375s: job.exception type=node-failure severity=2 lost contact with job shell on corona175
                                                 
flux-job: task(s) exited with exit code 143
$

Note that we see a severity 0 error from our sleep job which terminates it, and a severity 2 node-failure exception for the instance simulating a batch job.

After this, the depth=2 instance survives:

$ flux uptime
 18:56:27 run 8.9m,  owner grondo,  depth 2,  size 4,  1 offline
$ flux overlay status
0 corona171: degraded
├─ 1 corona173: full
├─ 2 corona174: full
└─ 3 corona175: lost

and we can use the remaining 3 nodes successfully:

$ flux mini run -N3 hostname
corona171
corona173
corona174

One the "initial program" (i.e. batch script) exits, the instance should exit normally:

$ exit
exit
[detached: session exiting]

But this last part doesn't seem to be working reliably. I'll have to look into why the instance is hanging around:

$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME NODELIST
    ƒECFSHkT grondo   flux        R      4      4   12.53m corona[171,173-175]

So we are not quite there, but close. It may be that the job-exec module needs to know that it never going to hear from a lost job shell again (probably the instance is waiting for that job shell to finish) Manually canceling the job cleans it up for now:

$ flux job cancel ƒECFSHkT

Finally, note that corona175 is still down in our depth=1 instance:

$ flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      3      144       24 corona[171,173-174]
 allocated      0        0        0 
      down      1       48        8 corona175

Sep 16 '22 02:09 grondo

@grondo cool! That seems very promising! Thank you

Sep 16 '22 02:09 ofaaland

@ofaaland: I've gotten a little further in #4615, though it is pretty experimental at this point.

Sep 26 '22 00:09 grondo

It looks like you've learned a lot about where (some of) the pointy bits are. Thanks! Very cool!

From: Mark Grondona @.***> Sent: Sunday, September 25, 2022 5:44 PM To: flux-framework/flux-core Cc: Faaland, Olaf P.; Mention Subject: Re: [flux-framework/flux-core] use case: SCR needs allocations to tolerate single node failure (Issue #4417)

@ofaalandhttps://urldefense.us/v3/__https://github.com/ofaaland__;!!G2kpM7uM-TzIFchu!lPAR59KMTOQ47gkmY0BjlF9gOfYQ4mHOzIeNZSPiEXoVgBRsNBU9W_c_Afm19v1RIw$: I've gotten a little further in #4615https://urldefense.us/v3/__https://github.com/flux-framework/flux-core/pull/4615__;!!G2kpM7uM-TzIFchu!lPAR59KMTOQ47gkmY0BjlF9gOfYQ4mHOzIeNZSPiEXoVgBRsNBU9W_c_AfnP81HoPQ$, though it is pretty experimental at this point.

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-core/issues/4417*issuecomment-1257330002__;Iw!!G2kpM7uM-TzIFchu!lPAR59KMTOQ47gkmY0BjlF9gOfYQ4mHOzIeNZSPiEXoVgBRsNBU9W_c_AflfEmgi1A$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AB73C47IJF37ENFV6P5F5A3WADWXXANCNFSM54AU3BVA__;!!G2kpM7uM-TzIFchu!lPAR59KMTOQ47gkmY0BjlF9gOfYQ4mHOzIeNZSPiEXoVgBRsNBU9W_c_AfntZu6-6g$. You are receiving this because you were mentioned.Message ID: @.***>

Sep 26 '22 03:09 ofaaland

FYI - #4615 is now proposed for merging. When the PR is merged, I think we could close this issue.

The most recent version of the PR adopts @garlick's idea of having child instances report their "critical ranks" (i.e. ranks for which the job exec system should raise a fatal job exception when they are "lost") to the parent instance, which effectively makes all batch (and flux mini alloc) jobs in Flux resilient to the loss of brokers that are leaf nodes in the overlay.

To make a job more resilient, you can increase the tbon.fanout or use tbon.fanout=0 for a flat topology, in which case only rank 0 is critical. (We may make a convenience option in flux mini to set this attribute along with any other tunings that might help with job resiliency)

Sep 28 '22 23:09 grondo

Closing this issue since #4615 was merged awhile ago. Flux instances can now survive the loss of non-critical ranks. Feel free to open new issues for other missing features!

Dec 10 '22 17:12 grondo

@grondo I'm resurrecting this old issue for some SCR work. In a nested instance, it looks like if I simulate a node failure by killing one of the non-critical brokers, the parent's doom plugin kicks in after a timeout. If I create the nested instance with -oexit-timeout=none (e.g. flux alloc -N3 -oexit-timeout=none) then the doom plugin doesn't kick in and the instance survives. So I was initially thinking that, in order to tolerate non-critical node failures, users would need to add the -oexit-timeout=none option. But I'm guessing that in the event of a real node failure, the parent instance wouldn't trigger the doom plugin at all, because it would see a lost rank of its own rather than a task that failed on an otherwise working node?

Just want to verify before I start working on the rather more arduous task of actually killing nodes.

Feb 08 '24 00:02 jameshcorbett

Yes, -oexit-timeout=none does have to be specified. Combining this option with something like a flat tbon and whatever else is needed for a resilient child job was the last task in #4583, but it wasn't really clear what the best approach would be for all resilient jobs, so that last step stalled out.

But I'm guessing that in the event of a real node failure, the parent instance wouldn't trigger the doom plugin at all, because it would see a lost rank of its own rather than a task that failed on an otherwise working node?

Yes, that's correct (one small correction though, the doom plugin is part of the job shell which has started the brokers, not the parent instance).

Actually, now that you mention it, I wonder if the doom plugin even makes sense for flux alloc and flux batch jobs. Maybe it would be nicer if we set -oexit-timeout=none by default for these jobs.

As a reminder, how this capability works is that the execution system tracks "critical ranks" for jobs. If it loses contact with a job shell (because of a node failure for instance) and that shell was on a critical rank, a job exception is raised. By default all ranks are critical, but all jobs that are instances of Flux report back a subset of critical ranks so that they can tolerate failure of leaf shells/nodes. Therefore, to make an instance as resilient as possible, the tbon.topo attribute should be set such that there are as few internal nodes in the topology as possible.

Hope that helps!

Feb 08 '24 02:02 grondo

Perfect, thank you!

Feb 08 '24 06:02 jameshcorbett

flux-core flux-core copied to clipboard

use case: SCR needs allocations to tolerate single node failure

flux-core
flux-core copied to clipboard