flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

restart: broker may load incorrect topology when restarting with running jobs

Open grondo opened this issue 1 month ago • 4 comments

On systems where user-specified options can modify the actual system topology at runtime (e.g., changing GPU partitioning modes), the broker may encounter a mismatch during restart.

Normally, the broker reads the system topology at startup when no jobs are running, so it sees the default topology. It verifies this matches its expected configuration. Jobs that modify the topology specify resource.rediscover so subinstances can reload and see the changes.

When restart with running jobs is supported, this becomes a problem: if a broker restarts while a job is running with a modified topology, it will read the modified topology from the system. The verification check will fail because the actual topology no longer matches the expected default configuration. The broker will incorrectly conclude the node is misconfigured and drain itself.

We need a solution that handles topology modifications during restart while preserving automatic detection of genuinely misconfigured nodes during verification.

Perhaps the previous topology could be cached in statedir when restarting on a live system (idea from @garlick)

grondo avatar Dec 03 '25 00:12 grondo

Another thought is to treat "soft restart" specially. Set a flag in the kvs when restarting flux, and when brokers come back up, skip the resource check as long as the nodes haven't rebooted?

garlick avatar Dec 03 '25 00:12 garlick

Yeah, I was thinking along the same lines. However, we'd still need to cache the previous hwloc XML somewhere so that the cached version doesn't have the extra/incorrect resources.

grondo avatar Dec 03 '25 00:12 grondo

It turns out this is a potential issue even now: Recently, a node was run out of memory and the resulting slowness caused the broker heartbeat timeout, so the broker exited and was restarted by systemd. The job running at the time had selected CPX gpu mode, so on restart the broker got the wrong topology.

grondo avatar Dec 03 '25 15:12 grondo

I wonder if we are validating resources in the wrong place? Should we move it to a prolog snippet?

garlick avatar Dec 03 '25 16:12 garlick