dcgm-exporter MIG device support for hpc

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

Please provide a clear description of the problem this feature solves

Currently the hpc_job metric label can only be applied to full GPUs, even though there are metrics for individual MIG partitions on MIGed GPUs. So in the case of a job running only on one 1g MIG partition, then all the metrics associated with that GPU will have the label for that job, and those metrics may be duplicated in the event a separate job is running on a different partition on the same GPU. Ideally there would be an option to apply the hpc_job label more granularly on MIG-enabled GPUs which would cleaner queries of those metrics when necessary (e.g. for a Slurm HPC cluster).

Feature Description

On machines with MIG-enabled GPUs where the executable is called with --hpc-job-mapping-dir=<HPC_DIR> or DCGM_HPC_JOB_MAPPING_DIR=<HPC_DIR> is set and the <HPC_DIR> directory contains files with names reflecting both the GPU and MIG partition of that GPU (e.g. 0.2, 1.3, etc.) and whose contents are one jobid per line, dcgm-exporter shall set the hpc_job label only to metrics with a matching gpu label and a matching label associated with a MIG partition ID (GPU_I_ID is currently set for metrics on MIG devices and would probably work just fine, but my instinct would be to implement a label associated with the EntityID as part of this change and use that instead) parsed from the name of the file (e.g. 0.2 for GPU=1 and GPU_I_ID=2) for each job id contained in the file.

Describe your ideal solution

Either parse filenames in the form GPU.GPU_I_ID and apply the label accordingly (code below), or keep the filename parsing the same and add extra parsing to the file contents that allow specifying the MIG partition ID alongside the jobid (e.g. file 0 contains the line 9: jobid42, so hpc_job="jobid42" would get applied to metrics with gpu="0" and GPU_I_ID="9").

Additional context

Here's the diff for a small change I did to implement this in our cluster, which is currently running as expected.

diff --git a/pkg/dcgmexporter/hpc.go b/pkg/dcgmexporter/hpc.go
index e360b09..61a95c3 100644
--- a/pkg/dcgmexporter/hpc.go
+++ b/pkg/dcgmexporter/hpc.go
@@ -18,6 +18,7 @@ package dcgmexporter

 import (
        "bufio"
+       "fmt"
        sysOS "os"
        "path"
        "strconv"
@@ -73,7 +74,7 @@ func (p *hpcMapper) Process(metrics MetricsByCounter, sysInfo SystemInfo) error
        for counter := range metrics {
                var modifiedMetrics []Metric
                for _, metric := range metrics[counter] {
-                       jobs, exists := gpuToJobMap[metric.GPU]
+                       jobs, exists := gpuToJobMap[getJobMapID(metric)]
                        if exists {
                                for _, job := range jobs {
                                        modifiedMetric, err := deepCopy(metric)
@@ -146,7 +147,7 @@ func getGPUFiles(dirPath string) ([]string, error) {
                        continue // Skip directories
                }

-               _, err = strconv.Atoi(file.Name())
+               _, err = strconv.ParseFloat(file.Name(), 64)
                if err != nil {
                        logrus.Debugf("HPC mapper: file %q name doesn't match with GPU ID convention", file.Name())
                        continue
@@ -156,3 +157,10 @@ func getGPUFiles(dirPath string) ([]string, error) {

        return mappingFiles, nil
 }
+
+func getJobMapID(m Metric) (string) {
+       if m.MigProfile != "" {
+               return fmt.Sprintf("%s.%s", m.GPU, m.GPUInstanceID)
+       }
+       return m.GPU
+}

And here's a sample of the metrics from one of our machines. You can see job 2115078 requested 4 1g.10gb partitions and the rest of the metrics for gpu 0 and 1 are not marked with that job id (and also that whoever submitted that job needs some training concerning their resource utilization).

DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-0ba88705-8527-ed9e-bc07-1b45788d4ef9",pci_bus_id="00000000:01:00.0",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115078"} 0.784913
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-0ba88705-8527-ed9e-bc07-1b45788d4ef9",pci_bus_id="00000000:01:00.0",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="14",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115078"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-0ba88705-8527-ed9e-bc07-1b45788d4ef9",pci_bus_id="00000000:01:00.0",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="2g.20gb",GPU_I_ID="5",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115116"} 0.290336
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-0ba88705-8527-ed9e-bc07-1b45788d4ef9",pci_bus_id="00000000:01:00.0",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115107"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-de420485-0d50-8d33-81e9-e6fa6c1d0a00",pci_bus_id="00000000:41:00.0",device="nvidia1",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115078"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-de420485-0d50-8d33-81e9-e6fa6c1d0a00",pci_bus_id="00000000:41:00.0",device="nvidia1",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="14",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115078"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-de420485-0d50-8d33-81e9-e6fa6c1d0a00",pci_bus_id="00000000:41:00.0",device="nvidia1",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="2g.20gb",GPU_I_ID="5",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115116"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-de420485-0d50-8d33-81e9-e6fa6c1d0a00",pci_bus_id="00000000:41:00.0",device="nvidia1",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-05fc3ac0-23d4-0a3b-d3ea-01f7f303efbe",pci_bus_id="00000000:81:00.0",device="nvidia2",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115081"} 0.709745
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-05fc3ac0-23d4-0a3b-d3ea-01f7f303efbe",pci_bus_id="00000000:81:00.0",device="nvidia2",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="14",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115081"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-05fc3ac0-23d4-0a3b-d3ea-01f7f303efbe",pci_bus_id="00000000:81:00.0",device="nvidia2",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="2g.20gb",GPU_I_ID="5",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115115"} 0.290853
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-05fc3ac0-23d4-0a3b-d3ea-01f7f303efbe",pci_bus_id="00000000:81:00.0",device="nvidia2",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-8c50fdca-3c3e-29c1-9e95-ba571dd1b382",pci_bus_id="00000000:C1:00.0",device="nvidia3",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115081"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-8c50fdca-3c3e-29c1-9e95-ba571dd1b382",pci_bus_id="00000000:C1:00.0",device="nvidia3",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115081"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-8c50fdca-3c3e-29c1-9e95-ba571dd1b382",pci_bus_id="00000000:C1:00.0",device="nvidia3",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="2g.20gb",GPU_I_ID="3",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08",hpc_job="2115115"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-8c50fdca-3c3e-29c1-9e95-ba571dd1b382",pci_bus_id="00000000:C1:00.0",device="nvidia3",modelName="NVIDIA A100-SXM4-80GB",GPU_I_PROFILE="3g.40gb",GPU_I_ID="2",Hostname="<HOSTNAME>",DCGM_FI_DRIVER_VERSION="545.23.08"} 0.000000

Jul 30 '24 21:07 jbrobstw

@jbrobstw , Will this approach work for other clusters? It would be interesting to hear the opinions of other HPC users.

In general, I think, that the format with filenames: GPU or GPU.GPU_I_ID is a good idea. We also, accept pull requests:)

Jul 30 '24 21:07 nvvfedorov

@nvvfedorov, I don't see why it wouldn't work on other clusters although my knowledge of kubernetes is limited, but almost definitely other slurm clusters. The way it's implemented above makes it optional even on MIGed GPUs (it does both filenaming conventions simultaneously, for instance), so at the very least should have no impact on anybody using it as-is.

I am not well-versed in go and wasn't super confident writing the tests for this change, hence the issue rather than pull request. Plus there were still some questions about the exact implementation.

@jbrobstw , Will this approach work for other clusters? It would be interesting to hear the opinions of other HPC users.

In general, I think, that the format with filenames: GPU or GPU.GPU_I_ID is a good idea. We also, accept pull requests:)

Jul 30 '24 21:07 jbrobstw

Hi @jbrobstw I went through the issue and looking to contribute here, though need some time for more clarification and understanding Wanted to know if I can take up this as my first issue here or if you have any suggestions lmk :) Thanks!