flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

nodes with misconfigured GPUs are not drained

Open grondo opened this issue 1 month ago • 5 comments

On a production system recently Flux was started on a node that was still in CPX mode, but it didn't get drained. Since the resource module caches the hwloc XML for reuse, this was causing confusion to mpibind when it used the CPX mode topology when CPX mode was not enabled.

There is a comment in topo_verify() that explains a rationale for ignoring GPUs, but I wonder if that is still relevant? We do have the noverify config option for cases where GPUs are not properly detected.

https://github.com/flux-framework/flux-core/blob/5a70c401bda8a16439a86578f830a101b897f526/src/modules/resource/topo.c#L100-L112

grondo avatar Dec 02 '25 22:12 grondo

Just to complicate this further, I've been working on a 'low noise mode' configuration where we exclude the cores that have OS tasks pinned to them from the resource config and use resource.rediscover and resource.norestrict in a CLI plugin to allow users who want all of the cores to rediscover them. So, I can imagine a scenario where we want to noverify the cpu configuration, but verify the gpu configuration.

ryanday36 avatar Dec 03 '25 00:12 ryanday36

Also for that case you may want the system instance broker pinned to the OS cores, but it should use the topology of the configured cores. We may need to ponder a better way to handle this.

grondo avatar Dec 03 '25 00:12 grondo

Until a fix for this issue merged and propagated to affected systems, it may be useful to install an rc1 task to do the equivalent, i.e. check that the count of GPUs matches the expected value during startup.

Here's a lightly tested modprobe rc1 task that does that. The task has the following features:

  • Only activates on system instances by using needs-config = "bootstrap" (I think that's the best way to check for a system instance at this time)
  • Skips excluded nodes (this isn't tested fully yet)
  • Gets the configured GPU count from the system R by fetching it from the KVS
  • Fetches the cached hwloc topology GPU count from the local broker
  • If the two do not match, or there is some other error, the node drains itself

To activate this script, simply drop it into /etc/flux/modprobe/rc1.d. Once it is no longer needed it can be removed, but it won't cause any problems or prevent updates of other modprobe tasks if it remains.

import sys
import subprocess
from tempfile import NamedTemporaryFile

import flux
from flux.modprobe import task
from flux.resource import ResourceSet
from flux.idset import IDset


def count_gpus(xml_string):
    """
    Count GPUs in a hwloc topology XML file.

    Unfortunately, this requires calling out to hwloc-calc, which doesn't
    seem to support taking the XML string on stdin, so a temporary file
    must be used.
    """
    # Write XML to a temporary file since hwloc-calc needs a file input
    with NamedTemporaryFile(mode="w", suffix=".xml", delete=False) as tmp:
        tmp.write(xml_string)
        tmp_path = tmp.name

    try:
        # Use hwloc-calc to count OS devices of type GPU
        result = subprocess.run(
            [
                "hwloc-calc",
                "--if",
                "xml",
                "--input",
                tmp_path,
                "--number-of",
                "osdev[gpu]",
                "all",
            ],
            stdout=subprocess.PIPE,
        )

        gpu_count = int(result.stdout.strip())
        return gpu_count

    except subprocess.CalledProcessError as e:
        raise ValueError(f"Error running hwloc-calc: {e.stderr}")
    finally:
        # Clean up temporary file
        try:
            os.unlink(tmp_path)
        except:
            pass


def drain(context, reason):
    """
    Drain the current rank with reason ``reason``.
    """
    payload = {"targets": str(context.rank), "reason": reason}
    context.handle.rpc("resource.drain", payload, nodeid=0).get()


def expected_ngpus(context):
    """
    Get the expected number of GPUs on this node from the resource
    inventory. If the current rank does not appear in the inventory,
    then return -1 which indicates this node does not need to be checked.
    """
    rset = ResourceSet(flux.kvs.get(context.handle, "resource.R"))
    if context.rank not in rset.ranks:
        return -1
    this_node = rset.copy_ranks(str(context.rank))
    return this_node.ngpus


def topo_ngpus(context):
    """
    Get the count of GPUs in the locally discovered hwloc topology cached
    in the broker resource module
    """
    return count_gpus(
        context.handle.rpc("resource.topo-get", nodeid=context.rank).get_str()
    )


@task("check-gpus", after=["resource"], needs_config=["bootstrap"])
def check_gpus(context):
    """
    Ensure that the cached XML topology in the resource module contains
    the expected number of GPUs from the resource inventory. Drain the node
    if the counts do not match.

    This should only be necessary in a system instance. Therefore, this
    task only runs if theres a bootstrap entry in the config.

    This is a workaround for flux-core issues #7220 and #7223.
    """

    # Skip alredy excluded ranks:
    if context.rank in IDset(context.conf_get("resource.exclude", "")):
        return

    try:
        expected_gpus = expected_ngpus(context)
        if expected_gpus < 0:
            return
        ngpus = topo_ngpus(context)
        if ngpus != expected_gpus:
            print(
                f"draining rank {context.rank}: {ngpus} GPUs != {expected_gpus}",
                file=sys.stderr,
                flush=True,
            )
            drain(context, f"check-gpus: expected {expected_gpus} GPUs, got {ngpus}")
    except Exception as exc:
        print(f"check-gpus: {exc}", file=sys.stderr)
        drain(context, f"check-gpus failed: {exc}")

grondo avatar Dec 05 '25 18:12 grondo

While working on this I discovered that the resource module only drains nodes with missing resources, not extra resources:

https://github.com/flux-framework/flux-core/blob/c31fb47f817e9892c5be0b36100f1e7fb46cdbd4/src/modules/resource/topo.c#L112-L120

So, removing the exclusion of GPUs would not fix the actual issue of concern here.

My plan is

  • remove the ignore of GPUs when validating resources
  • extend the noverify syntax to allow an optional list of of resource to skip verification, e..g noverify = ["core"]
  • add a new option to also drain on extra resources (i.e. when rlist_verify() returns > 0.

grondo avatar Dec 10 '25 17:12 grondo

After implementing the noverify = ["core"] support, I realized this approach does not cleanly handle the case for falling verification when more resources than configured are discovered.

Instead, maybe something like a new resource.verify table to configure verification if necessary should be developed. The verify table would support per-resource configuration of the verify mode, with supported modes:

  • strict: Strict verification, drain if resources do not match exactly
  • allow-extra: Allow extra resources of this type
  • allow-missing: Allow missing resources of this type
  • ignore: Do not verify this resource type

A default mode could also be supported. The default would then look something like this:

[resource.verify]
default = "allow-extra"
hostname = "strict"

To handle @ryanday36's case above where we'd want to allow extra cores, but not extra gpus:

[resource.verify]
gpu = "strict"

or alternately

[resource.verify]
default = "strict"
core = "allow-extra"

Figured I'd give people a minute to comment before attempting to implement this alternate approach.

grondo avatar Dec 11 '25 20:12 grondo