flux-sched icon indicating copy to clipboard operation
flux-sched copied to clipboard

-o gpu-affinity=per-task choosing 'wrong' gpus on tioga

Open ryanday36 opened this issue 3 years ago • 15 comments

see also https://rzlc.llnl.gov/jira/browse/ELCAP-179

The short version of this, I think, is that cpu-affinity and gpu-affinity assign the lowest numbered CPUs and lowest numbered GPUs to the lowest numbered tasks, but on the El Cap hardware, the lowest numbered CPUs are not "closest" (by bandwidth) to the lowest numbered GPUs. The mapping actually looks like:

Processor 0 : GPUs 4,5 Processor 1 : GPUs 2,3 Processor 2 : GPUs 6,7 Processor 3 : GPUs 0,1

whereas '-o cpu-affinity=per-task -o gpu-affinity=per-task' currently gives:

Processor 0 : GPUs 0,1 Processor 1 : GPUs 2,3 Processor 2 : GPUs 4,5 Processor 3 : GPUs 6,7

ryanday36 avatar Sep 27 '22 15:09 ryanday36

We may need to augment the gpubind shell plugin to use hwloc to assign GPUs to each task with -o gpu-affinity=per-task. Alternately, I wonder if mpibind would "just work" here?

There also may be an issue with the Fluxion scheduler here. It will have to know which GPUs are closest to which CPUs when assigning resources to jobs that share nodes, e.g. within a batch job. E.g. if a job asks for 1 task with 2 gpus per task and it is assigned Processor 0, does it get GPUS 4,5? I kind of doubt it. To fix, we may have to make Fluxion aware of the topology on this system somehow, e.g. by generating JGF to stick in the Rv1 .scheduling key. If you can validate if this is problem, then we should open a separate issue in flux-sched and strategize there.

grondo avatar Sep 27 '22 16:09 grondo

Mpibind is getting there. Historically (i.e. in the Slurm plugin version), it does well when one job has all of the resources on the node, but has trouble when multiple jobs are running on a node. I'm not sure yet how well the Flux plugin will do with the same cases. If I run multiple 'flux mini run -n1 ...' commands inside of an instance from a 'flux mini alloc -N1', are those using fluxion or are they scheduled by sched-simple? They appear to have the same GPU and CPU affinity sets as the 'flux mini run -n4' case.

ryanday36 avatar Sep 27 '22 16:09 ryanday36

The scheduler inside of a flux mini alloc or flux mini batch should still be Fluxion whenever flux-sched is installed. You can check with flux module list | grep sched. The only difference is that the configured scheduler "policy" will be the default instead of whatever the system policy is, e.g. one of the exclusive node policies "hinodex" or "lonodex".

Plus sched-simple doesn't support GPUs, so you'd get an error trying to request gpus with --gpus-per-task.

It would be interesting to see what R looks like for jobs within your flux mini alloc session when they request a single processor and 1 or 2 GPUs.

BTW, related to -o gpu-affinity=per-task @trws is our hwloc expert I think, and he may be able to suggest how to fix or replace the gpubind shell plugin so it selects the correct GPU from the set of available GPUs. (However, this assumes that the scheduler has chosen the correct GPUs to assign to the job)

grondo avatar Sep 27 '22 17:09 grondo

It might take a sched change if we aren't encoding the locality of the GPUs yet, we'll have to think about that. Depending on what the situation is, we might be able to encode the GPUs shown to sched and selected by it in a way that we don't have to know their number on the final node, but index them by say socket and logical id off of the socket? Will have to look into this. The mpibind plugin should select the local GPUs when it can, but if we're only giving it access to the ones sched selected that will not help.

trws avatar Sep 27 '22 18:09 trws

Injecting my name here so I get updates (I opened the jira)

jjellio avatar Sep 28 '22 00:09 jjellio

Transferred this issue to flux-sched since it is the thing assigning GPUs in this case.

grondo avatar Oct 26 '22 20:10 grondo

I investigated the behavior of hwloc on the Tioga system to see if and how it can generate an XML that can be loaded by the Fluxion resource-query utility. With resource-query I tested whether Fluxion can generate a mapping that takes into account CPU-GPU locality.

First some background. On tioga, lstopo (based onhwloc 2.1.9) returns a warning indicating that it is ignoring invalid hardware topology. The resulting XML file has the GPUs hanging off by themselves underneath the node. To detect the topology correctly, you need to set HWLOC_COMPONENTS=x86 in the environment:


[milroy1@tioga11:~]$ lstopo
Machine (503GB total)
  Package L#0
    Group0 L#0
      NUMANode L#0 (P#0 125GB)
      L3 L#0 (32MB)
        L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
        L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
        L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
        L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
        L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
        L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
        L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
        L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
        HostBridge
          PCIBridge
            PCI d1:00.0 (Display)
              GPU(RSMI) "rsmi4"
[...]

Loading the corresponding XML file works as expected with resource-query:


$ ./resource-query -L test.xml -f hwloc
resource-query> find status=up
[...]
      ---------------L3cache6[32768:x]
      ---------------------pu56[1:x]
      ------------------core56[1:x]
      ---------------------pu57[1:x]
      ------------------core57[1:x]
      ---------------------pu58[1:x]
      ------------------core58[1:x]
      ---------------------pu59[1:x]
      ------------------core59[1:x]
      ---------------------pu60[1:x]
      ------------------core60[1:x]
      ---------------------pu61[1:x]
      ------------------core61[1:x]
      ---------------------pu62[1:x]
      ------------------core62[1:x]
      ---------------------pu63[1:x]
      ------------------core63[1:x]
      ------------------gpu1[1:x]
      ---------------L3cache7[32768:x]
      ---------------numanode3[1:x]
      ------------group3[1:x]
      ---------socket0[1:x]
      ---------storage13[52:x]
      ---------storage14[52:x]
      ------tioga11[1:x]
      ---cluster0[1:x]
INFO: =============================
INFO: EXPRESSION="status=up"
INFO: =============================

But there's a lot more detail in the graph and hence number of resource vertices than can be represented by a jobspec generated from, e.g., flux run -N1 -c4 -n1 -g4 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 2:

{
  "resources": [
    {
      "type": "node",
      "count": 1,
      "with": [
        {
          "type": "slot",
          "count": 1,
          "with": [
            {
              "type": "core",
              "count": 4
            },
            {
              "type": "gpu",
              "count": 4
            }
          ],
          "label": "task"
        }
      ]
    }
  ],
  "tasks": [
    {
      "command": [
        "sleep",
        "2"
      ],
      "slot": "task",
      "count": {
        "per_slot": 1
      }
    }
  ],
  "attributes": {
    "system": {
      "duration": 0,
      "cwd": "",
      "shell": {
        "options": {
          "rlimit": {
            "cpu": -1,
            "fsize": -1,
            "data": -1,
            "stack": -1,
            "core": 16384,
            "nofile": 128000,
            "as": -1,
            "rss": -1,
            "nproc": 8192
          },
          "cpu-affinity": "per-task",
          "gpu-affinity": "per-task"
        }
      }
    }
  },
  "version": 1
}

This means that Fluxion won't return a match (converted to YAML):

resource-query> match allocate jobspec.yaml
INFO: =============================
INFO: No matching resources found
INFO: JOBID=1
INFO: =============================

However, a jobspec like this will work:

version: 1
resources:
  - type: node
    count: 1
    with:
     - type: socket
       count: 1
       with:
        - type: slot
          label: task
          count: 4
          with:
           - type: group
             count: 1
             with:
              - type: cache
                count: 65536
                with:
                 - type: gpu
                   count: 1
tasks:
- command:
  - sleep
  - '2'
  slot: task
  count:
    per_slot: 1
attributes:
  system:
    duration: 0
    cwd: ""
    shell:
      options:
        rlimit:
          cpu: -1
          fsize: -1
          data: -1
          stack: -1
          core: 16384
          nofile: 128000
          as: -1
          rss: -1
          nproc: 8192
        cpu-affinity: per-task
        gpu-affinity: per-task

Which produces the following mapping:

resource-query> match allocate jobspec.yaml
      ------------------gpu4[1:x]
      ---------------L3cache0[32768:x]
      ------------------gpu5[1:x]
      ---------------L3cache1[32768:x]
      ------------group0[1:x]
      ------------------gpu2[1:x]
      ---------------L3cache2[32768:x]
      ------------------gpu3[1:x]
      ---------------L3cache3[32768:x]
      ------------group1[1:x]
      ------------------gpu6[1:x]
      ---------------L3cache4[32768:x]
      ------------------gpu7[1:x]
      ---------------L3cache5[32768:x]
      ------------group2[1:x]
      ------------------gpu0[1:x]
      ---------------L3cache6[32768:x]
      ------------------gpu1[1:x]
      ---------------L3cache7[32768:x]
      ------------group3[1:x]
      ---------socket0[1:s]
      ------tioga11[1:s]
      ---cluster0[1:s]
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================

Note the mapping group0 --> GPU4,5, group1 --> GPU2,3, group2 --> GPU6,7, group3 --> GPU0,1. (Edited to fix the jobspec.)

milroy avatar Aug 09 '23 02:08 milroy

To clarify above, the jobspec I listed requests four "groups" (should they be discovered as sockets?) each with two GPUs to illustrate the mapping @ryanday36 reported in the first comment. To get the desired resources corresponding to the flux run command above, I'd need this:

version: 1
resources:
  - type: node
    count: 1
    with:
     - type: socket
       count: 1
       with:
        - type: slot
          label: task
          count: 1
          with:
           - type: group
             count: 4
             with:
              - type: cache
                count: 32768
                with:
                 - type: gpu
                   count: 1
tasks:
- command:
  - sleep
  - '2'
  slot: task
  count:
    per_slot: 1
attributes:
  system:
    duration: 0
    cwd: ""
    shell:
      options:
        rlimit:
          cpu: -1
          fsize: -1
          data: -1
          stack: -1
          core: 16384
          nofile: 128000
          as: -1
          rss: -1
          nproc: 8192
        cpu-affinity: per-task
        gpu-affinity: per-task

Note the ugliness with handling the cache count.

milroy avatar Aug 09 '23 02:08 milroy

I'll add that my findings don't demonstrate the mapping for an actual job. They strongly suggest that Fluxion will make the correct rank mapping. I'll figure out a way to get the mapping for an actual job ASAP.

milroy avatar Aug 09 '23 05:08 milroy

@grondo, I think I figured out a way to coerce core and sched to output the mapping we want. With the environment variable HWLOC_COMPONENTS=x86 set I generated an XML of tioga10 (lstopo --of xml tioga_node.xml). Then I can get Fluxion to load a resource graph based on the XML, passing in an allowlist that generates a resource graph with the locality embedded in the topology. I selected the simple match-format for legibility (which causes parsing errors):

[milroy1@tioga10]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-format=simple" flux start
[milroy1@tioga10]$ flux submit -c1 -n4 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f3AwVzTy
Sep 14 01:09:45.396952 job-manager.err[0]: cray_pals_port_distributor: Error fetching R from shell-counting future: Invalid argument
Sep 14 01:09:45.397108 job-list.err[0]: parse_R: job f3AwVzTy invalid R: '[' or '{' expected near '-'
[milroy1@tioga10]$ flux job info f3AwVzTy R
      ------------gpu4[1:x]
      ------------core15[1:x]
      ------------gpu5[1:x]
      ---------group0[1:s]
      ------------gpu2[1:x]
      ------------core31[1:x]
      ------------gpu3[1:x]
      ---------group1[1:s]
      ------------gpu6[1:x]
      ------------core47[1:x]
      ------------gpu7[1:x]
      ---------group2[1:s]
      ------------gpu0[1:x]
      ------------core63[1:x]
      ------------gpu1[1:x]
      ---------group3[1:s]
      ------tioga10[1:s]
      ---cluster0[1:s]

Note the mapping core15 --> gpu4,5 (in group 0 or processor 0) and the following which appears to respect the true physical locality.

milroy avatar Sep 14 '23 08:09 milroy

It is possible that the task-to-core mapping is not what's desired. A follow-up test very strongly suggests that the setup produces the right mapping (note the match-policy=high and match-policy=low):

[milroy1@tioga11:utilities]$ export HWLOC_COMPONENTS=x86
[milroy1@tioga11:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-policy=high" flux start
[milroy1@tioga11:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f87cF7wD
Sep 14 10:15:29.606505 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga11:utilities]$ flux job info f87cF7wD R
{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "63", "gpu": "0-1"}}], "nodelist": ["tioga10"], "starttime": 1694711729, "expiration": 4848311729}}
[milroy1@tioga11:utilities]$ exit
[milroy1@tioga11:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml match-policy=low" flux start
[milroy1@tioga11:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f3K7v2eK
Sep 14 10:16:03.794322 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga11:utilities]$ flux job info f3K7v2eK R
{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "0", "gpu": "4-5"}}], "nodelist": ["tioga10"], "starttime": 1694711763, "expiration": 4848311763}}

The two tests individually produce the desired locality-aware mapping.

milroy avatar Sep 14 '23 17:09 milroy

Great! I wonder if we can write a shell plugin, activated by an -o option, to dump the topology and set the environment variable on behalf of users before launching the broker. I can try to do that a bit later.

grondo avatar Sep 14 '23 17:09 grondo

One more question: This works for a single node, but if a job has multiple nodes I assume we'll need to fetch the topology for each node and load them separately into Fluxion.

The topology of a rank can currently be fetched via the resource.topo-get RPC. The job shell uses this to fetch a copy of the hwloc topology from the enclosing instance without needing to call hwloc_topology_load() which is very expensive. As a next step, we may want to add another option to Fluxion to fetch these XMLs from every rank directly via an RPC, instead of having to collect them into a filesystem location. Maybe this can be done via the config file instead of an environment variable since we now have a --conf=CONFIG option in flux alloc and flux batch.

grondo avatar Sep 14 '23 17:09 grondo

One more question: This works for a single node, but if a job has multiple nodes I assume we'll need to fetch the topology for each node and load them separately into Fluxion.

I actually wouldn't go so far as to say it works for a single node. My demo above just illustrates that the mapping can be done, but the jobs themselves fail:

[milroy1@tioga10:utilities]$ FLUXION_RESOURCE_OPTIONS="load-allowlist=node,gpu,group,core load-format=hwloc load-file=tioga_node.xml" flux start
[milroy1@tioga10:utilities]$ flux submit -c1 -n1 -g2 -o cpu-affinity=per-task -o gpu-affinity=per-task sleep 1
f9rhi9V1
Sep 14 23:41:40.633830 job-list.err[0]: rlist_from_json: : Invalid argument
[milroy1@tioga10:utilities]$ flux job info f9rhi9V1 eventlog
{"timestamp":1694760100.607444,"name":"submit","context":{"userid":<>,"urgency":16,"flags":0,"version":1}}
{"timestamp":1694760100.6202316,"name":"validate"}
{"timestamp":1694760100.6313109,"name":"depend"}
{"timestamp":1694760100.6313372,"name":"priority","context":{"priority":16}}
{"timestamp":1694760100.6334589,"name":"alloc"}
{"timestamp":1694760100.6334941,"name":"prolog-start","context":{"description":"cray-pals-port-distributor"}}
{"timestamp":1694760100.6337798,"name":"prolog-finish","context":{"description":"cray-pals-port-distributor","status":0}}
{"timestamp":1694760100.6350386,"name":"exception","context":{"type":"exec","severity":0,"userid":<>,"note":"reading R: R_lite: failed to read target rank list: Invalid argument"}}
{"timestamp":1694760100.636466,"name":"release","context":{"ranks":"all","final":true}}
{"timestamp":1694760100.636601,"name":"free"}
{"timestamp":1694760100.6366169,"name":"clean"}
[milroy1@tioga10:utilities]$ flux resource R
flux-resource: ERROR: Rlist: invalid argument

In this test case at least there's a mismatch between the hwloc reader and rv1exec which I think is causing the "failed to read target rank list" error.

As a next step, we may want to add another option to Fluxion to fetch these XMLs from every rank directly via an RPC, instead of having to collect them into a filesystem location.

Sorry, I'm a bit lost here. If we can use the resource.topo-get RPC why do we need to fetch the XMLs from each rank? Or the enclosing instance won't have sufficient topology information in this case, so we need to have Fluxion generate the resource graph via RPC that creates an XML with hwloc on each node?

milroy avatar Sep 15 '23 07:09 milroy

In this test case at least there's a mismatch between the hwloc reader and rv1exec which I think is causing the "failed to read target rank list" error.

Ah, ok, I see. Fluxion is creating an invalid Rv1 for the jobs:

{"version": 1, "execution": {"R_lite": [{"rank": "-1", "children": {"core": "0", "gpu": "4-5"}}], "nodelist": ["tioga10"], "starttime": 1694711763, "expiration": 4848311763}}

It appears Fluxion is perhaps just missing rank information in the graph? The rest of R looks fine anyway.

Sorry, I'm a bit lost here. If we can use the resource.topo-get RPC why do we need to fetch the XMLs from each rank?

Each core resource module only keeps the hwloc XML of its local resources. That is, there is not a way to fetch the XML for all ranks in the job with a single RPC. For now I was thinking Fluxion could send an RPC to each rank to collect the XML. Of course, as a stopgap we could perhaps have a shell plugin do this and write the XML to the job's TMPDIR, but it would be more efficient to have Fluxion do this directly. (This is just one idea of many though...)

grondo avatar Sep 15 '23 14:09 grondo

This issue came up again in the flux dev meeting yesterday, so just commenting here to revive the issue.

Reading above, I think what we need is a way to optionally collect the hwloc topo XML for all ranks in a job and feed it to Fluxion, or an equivalent option in Fluxion to fetch that information.

@milroy or @trws: has anything changed on the Fluxion side here in the past couple years? I could spend some time experimenting to see if we could get this going again..

grondo avatar Jun 19 '25 16:06 grondo

Not substantively, we operate on the IDs as logical IDs (effectively assuming, in absence of actual structure, that lower IDs are close to other lower IDs). We still have a direct hwloc reader, which I believe is still tested, though we're not currently using it. We would still need a way to connect the two together. Really what would likely be most useful in the short term would be to ensure that those logical IDs are translated back to physical IDs in exec or the shell. Having the actual locality in fluxion would definitely be good, but with what @ryanday36 showed in the issue description just doing that logical->physical translation even with hwloc-calc right before doing the env var assignment would fix it. Admittedly not fix it the long-term "right" way, but we'll need that logical to physical translation anyway because we don't want to be tied to whatever crazy variable numbering cuda or hip use from 1pm to 2pm on alternate Thursdays.

(if this doesn't make much sense, I just realized I'm kinda punchy, just got out of 8 hours of C++ committee followed by 4 hours of SC25 program committee meeting, probably time to go to sleep...)

trws avatar Jun 19 '25 17:06 trws

Really what would likely be most useful in the short term would be to ensure that those logical IDs are translated back to physical IDs in exec or the shell

@trws: I was going to work on this, but I'm afraid I'm not quite understanding how exactly you mean to translate the logical to physical id. Looking at hwloc on Tuolumne for example using this program:

#include <stdio.h>
#include <hwloc.h>

int main (int ac, char **av)
{
    hwloc_obj_t obj = NULL;
    hwloc_topology_t topo;

    hwloc_topology_init (&topo);
    hwloc_topology_set_io_types_filter (topo,
                                        HWLOC_TYPE_FILTER_KEEP_IMPORTANT);
    hwloc_topology_load (topo);

    while ((obj = hwloc_get_next_osdev(topo, obj)) != NULL) {
        if (obj->attr->osdev.type == HWLOC_OBJ_OSDEV_GPU)
            printf ("GPU: %s: subtype=%s ", obj->name, obj->subtype);
        else if (obj->attr->osdev.type == HWLOC_OBJ_OSDEV_COPROC)
            printf ("Coproc: %s: subtype=%s ", obj->name, obj->subtype);
        else
            continue;
        printf ("logical_index=%u os_index=%u sibling_rank=%u depth=%d\n",
                obj->logical_index,
                obj->os_index,
                obj->sibling_rank,
                obj->depth);
    }

    hwloc_topology_destroy (topo);
}

I'm getting:

GPU: rsmi0: subtype=RSMI logical_index=1 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi1: subtype=RSMI logical_index=4 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi2: subtype=RSMI logical_index=38 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi3: subtype=RSMI logical_index=40 os_index=4294967295 sibling_rank=0 depth=-6

whereas on tioga I see:

GPU: rsmi4: subtype=RSMI logical_index=0 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi5: subtype=RSMI logical_index=2 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi2: subtype=RSMI logical_index=20 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi3: subtype=RSMI logical_index=22 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi6: subtype=RSMI logical_index=23 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi7: subtype=RSMI logical_index=25 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi0: subtype=RSMI logical_index=26 os_index=4294967295 sibling_rank=0 depth=-6
GPU: rsmi1: subtype=RSMI logical_index=28 os_index=4294967295 sibling_rank=0 depth=-6

Obviously the logical_index is isn't helpful here. Is the strategy to iterate the GPU osdev objects and then use the id on the suffix of the device name as the physical id? Is there a member in the osdev where I can get this?

grondo avatar Aug 16 '25 00:08 grondo

I see what you mean, it's replicating them all because there are several nodes labeled "GPU" for each physical part. The device name indexes are the indexes that AMD actually uses, we can go from index to osdev pretty easily with hwloc_rsmi_get_device_osdev but it looks like there's not a good way to go from osdev to GPU id, or even to get a logical index for the device because they're merged with other PCI devices, the closest thing is the number on the "card" entry that's a sibling under the PCI device. Let me pick at this a bit and see if I can come up with a better way to do the translation we want. If it comes down to it, we could iterate over the ROCM IDs, sort them by their cpusets or something and add an attribute for what we want but that's a lot more work than should be necessary for this. I made the poor assumption there would be an easy translation like for cuda devices.

trws avatar Aug 19 '25 23:08 trws

Actually, I think it comes down to this: We have the index into the list of RSMI devices that we want to pull, it looks like we have to iterate over them to find that (can probably also use a filter to get there, or use cousin walking or something smarter that I can't think of right now). Once we have it, we can use the suffix of the name (which is utterly insane but it seems there isn't a field for it for some crazy reason) or we can use the AMDUUID=a30d0d26ba377c92 field from the osdev attributes and put that directly into the ROCM_VISIBLE_DEVICES. That value is a lot less pretty, but it's stable so no matter how many times that variable gets applied into child processes and restricted environments it stays correct. Where ROCM_VISIBLE_DEVICES=3 becomes an error if it's applied twice.

trws avatar Aug 19 '25 23:08 trws