dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Expose Container info for MIG enabled GPU

Open krishh85 opened this issue 11 months ago • 84 comments

Currently is doesn't seem like container/pod/namespace information is emitted from dcgm-exporter when MIG is enabled in GPU. This is important when we need to do gpu utilization aggregation across containers/cgroups. The contaiiner info seems to be emitted only on GPU's that have MIG disabled.

Version Info +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |

dcgm_fi_prof_gr_engine_active{gpu="0",uuid="GPU-ed1353d6-52ba-8793-7230-4d5d3eb68167",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="0",uuid="GPU-ed1353d6-52ba-8793-7230-4d5d3eb68167",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="2",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="1",uuid="GPU-6043e4b2-feaa-8010-34c7-61d1a01576bb",device="nvidia1",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="1",uuid="GPU-6043e4b2-feaa-8010-34c7-61d1a01576bb",device="nvidia1",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="2",container_name="",pod_name="",pod_namespace=""} 0.001359 dcgm_fi_prof_gr_engine_active{gpu="2",uuid="GPU-2dddf823-c00c-0fd6-3e84-ad84d322de02",device="nvidia2",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="2",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="2",uuid="GPU-2dddf823-c00c-0fd6-3e84-ad84d322de02",device="nvidia2",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="3",uuid="GPU-af2c63a8-8c8b-21cc-581a-bab6bba89d08",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="2",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="3",uuid="GPU-af2c63a8-8c8b-21cc-581a-bab6bba89d08",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="4",uuid="GPU-931446e6-0413-2864-5b76-11227dcda4ae",device="nvidia4",modelName="NVIDIA A100 80GB PCIe",container_name="-grpc-pod",pod_name="d85np-t6mdg",pod_namespace="ai-1"} 0.008246 dcgm_fi_prof_gr_engine_active{gpu="5",uuid="GPU-b3c602ed-2a60-c7e6-a685-a4a4ef0fb831",device="nvidia5",modelName="NVIDIA A100 80GB PCIe",container_name="language-models",pod_name="64bl2-8rsmr-tnrmr",pod_namespace="ai-7"} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="6",uuid="GPU-3f5d3fb2-5308-160b-ebe4-cf4c3c1e8b1d",device="nvidia6",modelName="NVIDIA A100 80GB PCIe",container_name=“llm",pod_name="-8lrkr-xqd9x",pod_namespace="ai-2"} 0.151145 dcgm_fi_prof_gr_engine_active{gpu="7",uuid="GPU-d41aa3aa-2bf3-27e9-1a2d-4f54cca57cbb",device="nvidia7",modelName="NVI

krishh85 avatar Feb 27 '24 20:02 krishh85

@krishh85 , POD names and namespaces will be available when POD runs workload and uses GPU. By default, the dcgm-exporter returns empty strings when it shows metrics read from the K8S node, and there aren't GPUs assigned to pods.

When you see metrics like: dcgm_fi_prof_gr_engine_active, pod_name="abc", pod_namespace="default", gpu="5" - this can be read as dcgm_fi_prof_gr_engine_active had a metric value X, for the gpu=5, when pod with name "abc" was assigned to the gpu=5.

nvvfedorov avatar Feb 27 '24 21:02 nvvfedorov

@nvvfedorov right, the question was specific to MIG instances, like the below metrics(dcgm_fi_prof_gr_engine_active) where there is a non-zero value (which i assume indicates that gpus are being used and pods are assigned) for the gpu MIG instance but there aren't any associated container/pod info. The container/pod info is present as you mentioned ONLY on GPU's which have MIG disabled.

dcgm_fi_prof_gr_engine_active{gpu="1",uuid="GPU-6043e4b2-feaa-8010-34c7-61d1a01576bb",device="nvidia1",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="2",c**ontainer_name="",pod_name="",pod_namespace=""**} 0.001359

krishh85 avatar Feb 27 '24 22:02 krishh85

@nvvfedorov Any update on this? Thanks

krishh85 avatar Feb 29 '24 00:02 krishh85

@nvvfedorov we also, ran a load test to simulate the traffic for a period of time(30 mins) and observed that none of the MIG metrics had container_name, pod_name, pod_namespace info.

Can you share an example where this has ever been populated? Thanks

dcgm_fi_prof_dram_active{gpu="0",uuid="GPU-bd8c4894-2c42-9838-f62a-659295d7c665",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.023462

krishh85 avatar Mar 01 '24 20:03 krishh85

@krishh85 , Can you provide details how did you run tests and what was an output?

nvvfedorov avatar Mar 01 '24 20:03 nvvfedorov

@nvvfedorov

  1. Ran a script which captures dcgm-exporter metrics from localhost /metrics endpoint.
  2. Setup inferencing request on a model served from a host. The hosts is a A100 gpu node with gpu's 0-3 MIg enabled. Each gpu has 2 slices with 2 different emory profiles(seen in the dcgm-exporter output)
  3. The inferenced was served by a single glu/slice gpu == 2 and gpu_i_id ==2 which has had the metric "dcgm_fi_prof_gr_engine_active" increased to 0.16 for the duration of the test. All other metrics (dcgm_fi_prof_dram_active) etc also had missing container info. . +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 1 2 0 3023233 C ..._classification_inference 1026MiB | | 1 2 0 3023234 C ..._classification_inference 1026MiB | | 1 2 0 3023255 C ..._classification_inference 1026MiB | | 1 2 0 3023259 C ..._classification_inference 1026MiB | | 1 2 0 3023260 C ..._classification_inference 1026MiB | | 1 2 0 3023261 C ..._classification_inference 1026MiB | | 2 1 0 4174369 C ...:resume_education_vm0m0p1 5486MiB | | 2 1 0 4174370 C ...:resume_education_vm0m0p1 5486MiB | | 2 1 0 4174371 C ...:resume_education_vm0m0p1 5486MiB | | 2 2 0 3094139 C ...oyment::create_embeddings 380MiB | | 2 2 0 3094140 C ....handle_request_streaming 380MiB | | 3 1 0 507462 C ...erveDeployment::inference 2796MiB | | 3 1 0 507463 C ...erveDeployment::inference 2796MiB | | 3 1 0 507464 C ...erveDeployment::inference 2796MiB | | 4 N/A N/A 61391 C ...erveDeployment::inference 18624MiB | | 4 N/A N/A 61392 C ....handle_request_streaming 21860MiB | | 4 N/A N/A 61393 C ...erveDeployment::inference 22000MiB | | 6 N/A N/A 3976023 C ...6/tensorflow_model_server 79910MiB |

Output from dcgm-exporter

https://gist.github.com/krishh85/6440d4efb6d158e40b035fefe4f70438

krishh85 avatar Mar 01 '24 22:03 krishh85

@nvvfedorov Any update on this? SHould be a simple test to see if it works as expected in your tests and if it does we can check if this is something specific to our config which i doubt as we haven't had any changes done in dcgm-exporter

krishh85 avatar Mar 05 '24 22:03 krishh85

@nvvfedorov Based on the code it seems like this is disabled for MIG resource names. Can you please confirm and if so any reason why this is not supported?

krishh85 avatar Mar 08 '24 21:03 krishh85

@krishh85, Can you provide more details about your environment configuration? What is your k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin) configuration? I am especially interested in the MIG_STRATEGY configuration: https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configuration-option-details.

Also, please run a shell in the dcgm-exporter container and run the following command:dcgmi discovery -l

nvvfedorov avatar Mar 08 '24 23:03 nvvfedorov

@nvvfedorov , Added the details. We use MIXED strategy with 2 mig slices on 4 gpus.(3g.40gb & 4g.40gb) env: - name: MIG_STRATEGY value: mixed - name: NVIDIA_MIG_MONITOR_DEVICES value: all - name: PASS_DEVICE_SPECS value: "true"

device_plugin yaml file:

dcgmi discovery and nvidia-smi output

krishh85 avatar Mar 09 '24 07:03 krishh85

@krishh85 , Thank you for the details. If you have access to the K8S node, where you run the workload, can you try to build https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main and run the client on the node? Unfortunately, kubectl doesn't provide commands to list "k8s.io/kubelet/pkg/apis/podresources/v1alpha1" API :(

As output of the command line you should see response something like this:

{
  "pod_resources": [
    {
      "name": "cuda-vector-add",
      "namespace": "default",
      "containers": [
        {
          "name": "cuda-vector-add",
          "devices": [
            {
              "resource_name": "nvidia.com/gpu",
              "device_ids": [
                "GPU-b9f9e81b-bee7-34bc-af17-132ef6592740"
              ]
            }
          ]
        }
      ]
    }
  ]
}

I am interesting to see entries with "resource_name": "nvidia.com/gpu" or "resource_name": "nvidia.com/mig-".

nvvfedorov avatar Mar 11 '24 16:03 nvvfedorov

@nvvfedorov I doubt I will be able to do it as I won't be able to download & run external packages on hosts without several reviews. Since you know the setup wouldn't you be able to test it on your end? At this point it seems like we are still trying to figure out if this is supported Vs bug?

krishh85 avatar Mar 11 '24 18:03 krishh85

@krishh85 , Thank you! It will take time, but we can reproduce it on our end.

nvvfedorov avatar Mar 11 '24 18:03 nvvfedorov

@krishh85, Unfortunately, I didn't manage to reproduce the issue:

Here are the steps that I tried:

  1. Enable MIG and create an MIG instance: https://gist.github.com/nvvfedorov/fb94e33827743557171d3adb54c13a8b#file-nvidia-smi-output

  2. Run workload: https://gist.github.com/nvvfedorov/fb94e33827743557171d3adb54c13a8b#file-workload-yaml

  3. Read nvidia-plugin-device logs:

kubectl -n kube-system logs nvidia-device-plugin-daemonset-hrszl

Output: https://gist.github.com/nvvfedorov/fb94e33827743557171d3adb54c13a8b#file-nvidia-device-plugin-log

  1. Query metrics: curl -v http://localhost:9400/metrics. Output: https://gist.github.com/nvvfedorov/fb94e33827743557171d3adb54c13a8b#file-metrics-output-out

As I can see, pod, namespace, and container names are in the metrics output.

nvvfedorov avatar Mar 13 '24 23:03 nvvfedorov

@krishh85 , I think you need to check the nvidia-device-plugin logs. If the nvidia device plugin doesn't see MIG instances and configuration, the dcgm-exporter can not map metrics on pods.

nvvfedorov avatar Mar 13 '24 23:03 nvvfedorov

@nvvfedorov I have attached the nvidia-device logs and don't really see anything that is different except for the fact that you are running the device plugin as a container in a pod where as we run it as a daemonset. ref nvidia-device- plugin logs

The resources also seems to be the same and would that mean the k8s api response would be the same as defined/registered in the device plugin. Do we still need to run the debugging tool? example { "pattern": "3g.40gb", "name": "nvidia.com/mig-3g.40gb" }

krishh85 avatar Mar 14 '24 00:03 krishh85

Please check if the dcgm-exporter runs with "--kubernetes" parameter. How the dcgm-exporter was deployed?

nvvfedorov avatar Mar 14 '24 00:03 nvvfedorov

@krishh85, There are two ways to enable kubernetes support in the dcgm-exporter:

  1. Set environment variable: DCGM_EXPORTER_KUBERNETES = true
  2. Pass command line parameter: "--kubernetes".

Please check that the dcgm-exporter runs with one of those options.

nvvfedorov avatar Mar 14 '24 00:03 nvvfedorov

@krishh85 , Instead of running the debug tool, you can add: logrus.Infof("deviceToPod: %+v",deviceToPod ) here: https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L78 and rebuild and deploy the dcgm-exporter image.

nvvfedorov avatar Mar 14 '24 00:03 nvvfedorov

@nvvfedorov current arg doesn't seem to have kubernetes flag passed. I can enable this but i would expect even non-MIg gpus to have pod info missing which isn't the case.(couldn't find this info the the dcgm-exporter code ) but can certainly enable it.

arguments: [ "--collectors", "/etc/dcgm-exporter/1.x-compatibility-metrics.csv", "--use-old-namespace", "--no-hostname" ]

krishh85 avatar Mar 14 '24 00:03 krishh85

@nvvfedorov seems like the only reference to the flag kubernetes -> https://github.com/NVIDIA/dcgm-exporter/blob/603e44da91269302e51f309c06c05689c8a49fd6/pkg/dcgmexporter/pipeline.go#L122

krishh85 avatar Mar 14 '24 01:03 krishh85

The flag is defined here: https://github.com/NVIDIA/dcgm-exporter/blob/603e44da91269302e51f309c06c05689c8a49fd6/pkg/cmd/app.go#L112

You are right, the line on what you pointed, uses the flag value to enable mapping on pods.

— Thank you, Vadym Fedorov


From: krishh85 @.> Sent: Wednesday, March 13, 2024 8:52:43 PM To: NVIDIA/dcgm-exporter @.> Cc: Vadym Fedorov @.>; Mention @.> Subject: Re: [NVIDIA/dcgm-exporter] Expose Container info for MIG enabled GPU (Issue #272)

@nvvfedorovhttps://github.com/nvvfedorov seems like the only reference to the flag kubernetes -> https://github.com/NVIDIA/dcgm-exporter/blob/603e44da91269302e51f309c06c05689c8a49fd6/pkg/dcgmexporter/pipeline.go#L122

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/dcgm-exporter/issues/272#issuecomment-1996248660, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BEMQJM2B2ZCVCHXJXJTXIQTYYD7GXAVCNFSM6AAAAABD4X3BCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGI2DQNRWGA. You are receiving this because you were mentioned.Message ID: @.***>

nvvfedorov avatar Mar 14 '24 02:03 nvvfedorov

@nvvfedorov Even though the code sets the default as false , I check the env variable and it is set to true. It also doesn't explain why non-mig gpu's emit the pod resources.i.e if it were false then i would expect it to not emit any pod resources irrespective of mig status.

root@ltx1-hcl40912:/# printenv | grep DCGM_EXPORTER_KUBERNETES DCGM_EXPORTER_KUBERNETES=true

krishh85 avatar Mar 14 '24 18:03 krishh85

@krishh85 , Please explain what you mean: "on-mig gpu's emit the pod resources".

nvvfedorov avatar Mar 14 '24 19:03 nvvfedorov

@nvvfedorov I meant on gpu's that do not have MIG enabled. If you look at the dcgm-exporter output you will notice that gpu's 4-7 do have podresources being published. The '--kubernetes' flag doesn't seem to be specific to MIG .

Example: dcgm_fi_prof_gr_engine_active{gpu="6",uuid="GPU-3f5d3fb2-5308-160b-ebe4-cf4c3c1e8b1d",device="nvidia6",modelName="NVIDIA A100 80GB PCIe",container_name=“llm",pod_name="-8lrkr-xqd9x",pod_namespace="ai-2"} 0.151145

krishh85 avatar Mar 14 '24 20:03 krishh85

@krishh85 , Here are the metrics from your message:

dcgm_fi_prof_gr_engine_active{gpu="3",uuid="GPU-af2c63a8-8c8b-21cc-581a-bab6bba89d08",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000

dcgm_fi_prof_gr_engine_active{gpu="4",uuid="GPU-931446e6-0413-2864-5b76-11227dcda4ae",device="nvidia4",modelName="NVIDIA A100 80GB PCIe",container_name="-grpc-pod",pod_name="d85np-t6mdg",pod_namespace="ai-1"} 0.008246

As we see from the metrics output:

  • There is no pod associated with GPU = 3 =>
dcgm_fi_prof_gr_engine_active{gpu="3",uuid="GPU-af2c63a8-8c8b-21cc-581a-bab6bba89d08",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000
  • There is the "d85np-t6mdg" pod created in the namespace: "ai-1"
dcgm_fi_prof_gr_engine_active{gpu="4",uuid="GPU-931446e6-0413-2864-5b76-11227dcda4ae",device="nvidia4",modelName="NVIDIA A100 80GB PCIe",container_name="-grpc-pod",pod_name="d85np-t6mdg",pod_namespace="ai-1"} 0.008246

For non-empty "pod_namespace," "pod_name," and "container_name" values, a pod with a gpu=3 resource must exist. If not, dcgm-exporter returns empty values for these metrics. These labels depend on the existence of the pod only. If the pod doesn't exist, the dcgm-exporter continues to read metrics from GPU but will not be able to populate "pod_namespace, "pod_name," and "container_name" values. This behavior doesn't depend on MIG configuration.

nvvfedorov avatar Mar 14 '24 21:03 nvvfedorov

@nvvfedorov Feels like we are going back in circles, If you check the dcgm-exporter log and some of the initial comments example below.

dcgm_fi_prof_dram_active{gpu="0",uuid="GPU-bd8c4894-2c42-9838-f62a-659295d7c665",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.023462]

Here we have gpu = 0 with GPU_I_ID =1 , here we have "empty" pod_resources but the utilization is 0.023462 . I.e inference requests were served from this MIG enabled gpu and that couldn't have been possible without a pod/container allocated to the gpu 0. Is this understanding correct?

krishh85 avatar Mar 14 '24 21:03 krishh85

@krishh85, here is what could happen: the pod was deleted, but the metric value was available.

nvvfedorov avatar Mar 14 '24 22:03 nvvfedorov

The dcgm-exporter reads metric values from the driver and directly from the GPU. The driver and GPU don't know about pods and the environment.

When we read the metric, we try to map metric attributes like GPU on POD, by reading podresources and doing mapping by GPU ID. If pod resources don't return a matched pod, the dcgm-exporter produces empty pod labels.

nvvfedorov avatar Mar 14 '24 22:03 nvvfedorov

@nvvfedorov , we ran load tests and observed over a period of time(10-15) and increased the requests 2x, and we observed the metric/utilization values also increase. Also, we have several apps running in production in MIG setup so pod's can't be deleted and we have never seen pod resource exposed on those MIG gpus.

krishh85 avatar Mar 14 '24 22:03 krishh85