flux-core librlist: rlist_from_hwloc() encodes logical instead of physical GPUs

The function used by flux-core to "discover" a local Rv1 via hwloc indexes GPU devices via the following:

char * rhwloc_gpu_idset_string (hwloc_topology_t topo)
{
    int index;
    char *result = NULL;
    hwloc_obj_t obj = NULL;
    struct idset *ids = idset_create (0, IDSET_FLAG_AUTOGROW);

    if (!ids)
        return NULL;

    /*  Manually index GPUs -- os_index does not seem to be valid for
     *  these devices in some cases, and logical index also seems
     *  incorrect (?)
     */
    index = 0;
    while ((obj = hwloc_get_next_osdev (topo, obj))) {
        /* Only count cudaX and openclX devices for now */
        const char *s = hwloc_obj_get_info_by_name (obj, "Backend");
        if (s && ((strcmp (s, "CUDA") == 0) || (strcmp (s, "OpenCL") == 0)))
            idset_set (ids, index++);
    }
    if (idset_count (ids) > 0)
        result = idset_encode (ids, IDSET_FLAG_RANGE);
    idset_destroy (ids);
    return result;
}

However, this will number GPU devices logically when CUDA_VISIBLE_DEVICES is set, e.g. for CUDA_VISIBLE_DEVICES=1 "gpu": "0" will be set in the resulting R object since there will be only one GPU discovered by hwloc:

$ src/cmd/flux R encode --local | src/cmd/flux R decode --short
rank0/core[0-35],gpu[0-1]
$ CUDA_VISIBLE_DEVICES=1 src/cmd/flux R encode --local | src/cmd/flux R decode --short
rank0/core[0-35],gpu0
$ CUDA_VISIBLE_DEVICES=0 src/cmd/flux R encode --local | src/cmd/flux R decode --short
rank0/core[0-35],gpu0

A fix would be to check for CUDA_VISIBLE_DEVICES in the environment, and if set, grab the GPU device index from that variable (treated as an idset), rather than using an arbitrary logical index. This would instead produce:

$ CUDA_VISIBLE_DEVICES=1  src/cmd/flux R encode --local | src/cmd/flux R decode --short
rank0/core[0-35],gpu1

This issue is perhaps academic, since flux-core doesn't have any way currently to schedule gpus, and fluxion will discover these resources separately on its own. However, it would be nice for testing if the discovery worked correctly here.

Nov 19 '20 00:11 grondo

Is this still an issue? Since fluxion now uses Rv1 rather than hwloc, the "academic" comment at the end might no longer apply...

Apr 05 '22 14:04 garlick

Yeah, I think it might be even more of an issue now. If Flux is launched by a foreign RM (so that the instance is doing resource discovery via hwloc) with CUDA_VISIBLE_DEVICES set, now even Fluxion will see the logical GPU device ids instead of physical, which could conceivably result in multiple jobs scheduled to the same GPU.

I'm struggling to decide if this is a practical problem though. I think there would only be an issue if multiple Flux instances were scheduled to the same node, each with a subset of GPUs assigned (which seems an unlikely case). In any other case, I think Flux components will be using R from the parent so each will have different GPU ids assigned...

Apr 05 '22 14:04 grondo

At some point, https://github.com/flux-framework/flux-sched/pull/926#issuecomment-1090438275 needs to be ported to rv1exec reader so that we get some coverage in this space.

For the foreign RM issue, I wonder if this should be solve by some documentation. "To ensure correctness, the foreign RM must use a more deterministic GPU numbering such as CUDA_DEVICE_ORDER=PCI_BUS_ID."

Apr 06 '22 16:04 dongahn