dcgm-exporter
dcgm-exporter copied to clipboard
Expose Container info for MIG enabled GPU
Currently is doesn't seem like container/pod/namespace information is emitted from dcgm-exporter when MIG is enabled in GPU. This is important when we need to do gpu utilization aggregation across containers/cgroups. The contaiiner info seems to be emitted only on GPU's that have MIG disabled.
Version Info +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
dcgm_fi_prof_gr_engine_active{gpu="0",uuid="GPU-ed1353d6-52ba-8793-7230-4d5d3eb68167",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="0",uuid="GPU-ed1353d6-52ba-8793-7230-4d5d3eb68167",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="2",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="1",uuid="GPU-6043e4b2-feaa-8010-34c7-61d1a01576bb",device="nvidia1",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="1",uuid="GPU-6043e4b2-feaa-8010-34c7-61d1a01576bb",device="nvidia1",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="2",container_name="",pod_name="",pod_namespace=""} 0.001359 dcgm_fi_prof_gr_engine_active{gpu="2",uuid="GPU-2dddf823-c00c-0fd6-3e84-ad84d322de02",device="nvidia2",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="2",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="2",uuid="GPU-2dddf823-c00c-0fd6-3e84-ad84d322de02",device="nvidia2",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="3",uuid="GPU-af2c63a8-8c8b-21cc-581a-bab6bba89d08",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="2",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="3",uuid="GPU-af2c63a8-8c8b-21cc-581a-bab6bba89d08",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="4",uuid="GPU-931446e6-0413-2864-5b76-11227dcda4ae",device="nvidia4",modelName="NVIDIA A100 80GB PCIe",container_name="-grpc-pod",pod_name="d85np-t6mdg",pod_namespace="ai-1"} 0.008246 dcgm_fi_prof_gr_engine_active{gpu="5",uuid="GPU-b3c602ed-2a60-c7e6-a685-a4a4ef0fb831",device="nvidia5",modelName="NVIDIA A100 80GB PCIe",container_name="language-models",pod_name="64bl2-8rsmr-tnrmr",pod_namespace="ai-7"} 0.000000 dcgm_fi_prof_gr_engine_active{gpu="6",uuid="GPU-3f5d3fb2-5308-160b-ebe4-cf4c3c1e8b1d",device="nvidia6",modelName="NVIDIA A100 80GB PCIe",container_name=“llm",pod_name="-8lrkr-xqd9x",pod_namespace="ai-2"} 0.151145 dcgm_fi_prof_gr_engine_active{gpu="7",uuid="GPU-d41aa3aa-2bf3-27e9-1a2d-4f54cca57cbb",device="nvidia7",modelName="NVI
@krishh85 , POD names and namespaces will be available when POD runs workload and uses GPU. By default, the dcgm-exporter returns empty strings when it shows metrics read from the K8S node, and there aren't GPUs assigned to pods.
When you see metrics like: dcgm_fi_prof_gr_engine_active
, pod_name="abc", pod_namespace="default", gpu="5" - this can be read as dcgm_fi_prof_gr_engine_active had a metric value X, for the gpu=5, when pod with name "abc" was assigned to the gpu=5.
@nvvfedorov right, the question was specific to MIG instances, like the below metrics(dcgm_fi_prof_gr_engine_active) where there is a non-zero value (which i assume indicates that gpus are being used and pods are assigned) for the gpu MIG instance but there aren't any associated container/pod info. The container/pod info is present as you mentioned ONLY on GPU's which have MIG disabled.
dcgm_fi_prof_gr_engine_active{gpu="1",uuid="GPU-6043e4b2-feaa-8010-34c7-61d1a01576bb",device="nvidia1",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="2",c**ontainer_name="",pod_name="",pod_namespace=""**} 0.001359
@nvvfedorov Any update on this? Thanks
@nvvfedorov we also, ran a load test to simulate the traffic for a period of time(30 mins) and observed that none of the MIG metrics had container_name, pod_name, pod_namespace info.
Can you share an example where this has ever been populated? Thanks
dcgm_fi_prof_dram_active{gpu="0",uuid="GPU-bd8c4894-2c42-9838-f62a-659295d7c665",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.023462
@krishh85 , Can you provide details how did you run tests and what was an output?
@nvvfedorov
- Ran a script which captures dcgm-exporter metrics from localhost /metrics endpoint.
- Setup inferencing request on a model served from a host. The hosts is a A100 gpu node with gpu's 0-3 MIg enabled. Each gpu has 2 slices with 2 different emory profiles(seen in the dcgm-exporter output)
- The inferenced was served by a single glu/slice gpu == 2 and gpu_i_id ==2 which has had the metric "dcgm_fi_prof_gr_engine_active" increased to 0.16 for the duration of the test. All other metrics (dcgm_fi_prof_dram_active) etc also had missing container info. . +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 1 2 0 3023233 C ..._classification_inference 1026MiB | | 1 2 0 3023234 C ..._classification_inference 1026MiB | | 1 2 0 3023255 C ..._classification_inference 1026MiB | | 1 2 0 3023259 C ..._classification_inference 1026MiB | | 1 2 0 3023260 C ..._classification_inference 1026MiB | | 1 2 0 3023261 C ..._classification_inference 1026MiB | | 2 1 0 4174369 C ...:resume_education_vm0m0p1 5486MiB | | 2 1 0 4174370 C ...:resume_education_vm0m0p1 5486MiB | | 2 1 0 4174371 C ...:resume_education_vm0m0p1 5486MiB | | 2 2 0 3094139 C ...oyment::create_embeddings 380MiB | | 2 2 0 3094140 C ....handle_request_streaming 380MiB | | 3 1 0 507462 C ...erveDeployment::inference 2796MiB | | 3 1 0 507463 C ...erveDeployment::inference 2796MiB | | 3 1 0 507464 C ...erveDeployment::inference 2796MiB | | 4 N/A N/A 61391 C ...erveDeployment::inference 18624MiB | | 4 N/A N/A 61392 C ....handle_request_streaming 21860MiB | | 4 N/A N/A 61393 C ...erveDeployment::inference 22000MiB | | 6 N/A N/A 3976023 C ...6/tensorflow_model_server 79910MiB |
Output from dcgm-exporter
https://gist.github.com/krishh85/6440d4efb6d158e40b035fefe4f70438
@nvvfedorov Any update on this? SHould be a simple test to see if it works as expected in your tests and if it does we can check if this is something specific to our config which i doubt as we haven't had any changes done in dcgm-exporter
@nvvfedorov Based on the code it seems like this is disabled for MIG resource names. Can you please confirm and if so any reason why this is not supported?
@krishh85, Can you provide more details about your environment configuration? What is your k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin) configuration? I am especially interested in the MIG_STRATEGY configuration: https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configuration-option-details.
Also, please run a shell in the dcgm-exporter container and run the following command:dcgmi discovery -l
@nvvfedorov , Added the details. We use MIXED strategy with 2 mig slices on 4 gpus.(3g.40gb & 4g.40gb) env: - name: MIG_STRATEGY value: mixed - name: NVIDIA_MIG_MONITOR_DEVICES value: all - name: PASS_DEVICE_SPECS value: "true"
@krishh85 , Thank you for the details. If you have access to the K8S node, where you run the workload, can you try to build https://github.com/k8stopologyawareschedwg/podresourcesapi-tools/tree/main and run the client on the node? Unfortunately, kubectl doesn't provide commands to list "k8s.io/kubelet/pkg/apis/podresources/v1alpha1" API :(
As output of the command line you should see response something like this:
{
"pod_resources": [
{
"name": "cuda-vector-add",
"namespace": "default",
"containers": [
{
"name": "cuda-vector-add",
"devices": [
{
"resource_name": "nvidia.com/gpu",
"device_ids": [
"GPU-b9f9e81b-bee7-34bc-af17-132ef6592740"
]
}
]
}
]
}
]
}
I am interesting to see entries with "resource_name": "nvidia.com/gpu" or "resource_name": "nvidia.com/mig-".
@nvvfedorov I doubt I will be able to do it as I won't be able to download & run external packages on hosts without several reviews. Since you know the setup wouldn't you be able to test it on your end? At this point it seems like we are still trying to figure out if this is supported Vs bug?
@krishh85 , Thank you! It will take time, but we can reproduce it on our end.
@krishh85, Unfortunately, I didn't manage to reproduce the issue:
Here are the steps that I tried:
-
Enable MIG and create an MIG instance: https://gist.github.com/nvvfedorov/fb94e33827743557171d3adb54c13a8b#file-nvidia-smi-output
-
Run workload: https://gist.github.com/nvvfedorov/fb94e33827743557171d3adb54c13a8b#file-workload-yaml
-
Read nvidia-plugin-device logs:
kubectl -n kube-system logs nvidia-device-plugin-daemonset-hrszl
Output: https://gist.github.com/nvvfedorov/fb94e33827743557171d3adb54c13a8b#file-nvidia-device-plugin-log
- Query metrics:
curl -v http://localhost:9400/metrics
. Output: https://gist.github.com/nvvfedorov/fb94e33827743557171d3adb54c13a8b#file-metrics-output-out
As I can see, pod, namespace, and container names are in the metrics output.
@krishh85 , I think you need to check the nvidia-device-plugin
logs. If the nvidia device plugin doesn't see MIG instances and configuration, the dcgm-exporter can not map metrics on pods.
@nvvfedorov I have attached the nvidia-device logs and don't really see anything that is different except for the fact that you are running the device plugin as a container in a pod where as we run it as a daemonset. ref nvidia-device- plugin logs
The resources also seems to be the same and would that mean the k8s api response would be the same as defined/registered in the device plugin. Do we still need to run the debugging tool? example { "pattern": "3g.40gb", "name": "nvidia.com/mig-3g.40gb" }
Please check if the dcgm-exporter runs with "--kubernetes" parameter. How the dcgm-exporter was deployed?
@krishh85, There are two ways to enable kubernetes support in the dcgm-exporter:
- Set environment variable: DCGM_EXPORTER_KUBERNETES = true
- Pass command line parameter: "--kubernetes".
Please check that the dcgm-exporter runs with one of those options.
@krishh85 , Instead of running the debug tool, you can add: logrus.Infof("deviceToPod: %+v",deviceToPod )
here: https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L78 and rebuild and deploy the dcgm-exporter image.
@nvvfedorov current arg doesn't seem to have kubernetes flag passed. I can enable this but i would expect even non-MIg gpus to have pod info missing which isn't the case.(couldn't find this info the the dcgm-exporter code ) but can certainly enable it.
arguments: [ "--collectors", "/etc/dcgm-exporter/1.x-compatibility-metrics.csv", "--use-old-namespace", "--no-hostname" ]
@nvvfedorov seems like the only reference to the flag kubernetes -> https://github.com/NVIDIA/dcgm-exporter/blob/603e44da91269302e51f309c06c05689c8a49fd6/pkg/dcgmexporter/pipeline.go#L122
The flag is defined here: https://github.com/NVIDIA/dcgm-exporter/blob/603e44da91269302e51f309c06c05689c8a49fd6/pkg/cmd/app.go#L112
You are right, the line on what you pointed, uses the flag value to enable mapping on pods.
— Thank you, Vadym Fedorov
From: krishh85 @.> Sent: Wednesday, March 13, 2024 8:52:43 PM To: NVIDIA/dcgm-exporter @.> Cc: Vadym Fedorov @.>; Mention @.> Subject: Re: [NVIDIA/dcgm-exporter] Expose Container info for MIG enabled GPU (Issue #272)
@nvvfedorovhttps://github.com/nvvfedorov seems like the only reference to the flag kubernetes -> https://github.com/NVIDIA/dcgm-exporter/blob/603e44da91269302e51f309c06c05689c8a49fd6/pkg/dcgmexporter/pipeline.go#L122
— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/dcgm-exporter/issues/272#issuecomment-1996248660, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BEMQJM2B2ZCVCHXJXJTXIQTYYD7GXAVCNFSM6AAAAABD4X3BCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGI2DQNRWGA. You are receiving this because you were mentioned.Message ID: @.***>
@nvvfedorov Even though the code sets the default as false , I check the env variable and it is set to true. It also doesn't explain why non-mig gpu's emit the pod resources.i.e if it were false then i would expect it to not emit any pod resources irrespective of mig status.
root@ltx1-hcl40912:/# printenv | grep DCGM_EXPORTER_KUBERNETES DCGM_EXPORTER_KUBERNETES=true
@krishh85 , Please explain what you mean: "on-mig gpu's emit the pod resources".
@nvvfedorov I meant on gpu's that do not have MIG enabled. If you look at the dcgm-exporter output you will notice that gpu's 4-7 do have podresources being published. The '--kubernetes' flag doesn't seem to be specific to MIG .
Example: dcgm_fi_prof_gr_engine_active{gpu="6",uuid="GPU-3f5d3fb2-5308-160b-ebe4-cf4c3c1e8b1d",device="nvidia6",modelName="NVIDIA A100 80GB PCIe",container_name=“llm",pod_name="-8lrkr-xqd9x",pod_namespace="ai-2"} 0.151145
@krishh85 , Here are the metrics from your message:
dcgm_fi_prof_gr_engine_active{gpu="3",uuid="GPU-af2c63a8-8c8b-21cc-581a-bab6bba89d08",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000
dcgm_fi_prof_gr_engine_active{gpu="4",uuid="GPU-931446e6-0413-2864-5b76-11227dcda4ae",device="nvidia4",modelName="NVIDIA A100 80GB PCIe",container_name="-grpc-pod",pod_name="d85np-t6mdg",pod_namespace="ai-1"} 0.008246
As we see from the metrics output:
- There is no pod associated with GPU = 3 =>
dcgm_fi_prof_gr_engine_active{gpu="3",uuid="GPU-af2c63a8-8c8b-21cc-581a-bab6bba89d08",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.000000
- There is the "d85np-t6mdg" pod created in the namespace: "ai-1"
dcgm_fi_prof_gr_engine_active{gpu="4",uuid="GPU-931446e6-0413-2864-5b76-11227dcda4ae",device="nvidia4",modelName="NVIDIA A100 80GB PCIe",container_name="-grpc-pod",pod_name="d85np-t6mdg",pod_namespace="ai-1"} 0.008246
For non-empty "pod_namespace," "pod_name," and "container_name" values, a pod with a gpu=3 resource must exist. If not, dcgm-exporter returns empty values for these metrics. These labels depend on the existence of the pod only. If the pod doesn't exist, the dcgm-exporter continues to read metrics from GPU but will not be able to populate "pod_namespace, "pod_name," and "container_name" values. This behavior doesn't depend on MIG configuration.
@nvvfedorov Feels like we are going back in circles, If you check the dcgm-exporter log and some of the initial comments example below.
dcgm_fi_prof_dram_active{gpu="0",uuid="GPU-bd8c4894-2c42-9838-f62a-659295d7c665",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="4g.40gb",GPU_I_ID="1",container_name="",pod_name="",pod_namespace=""} 0.023462]
Here we have gpu = 0 with GPU_I_ID =1 , here we have "empty" pod_resources but the utilization is 0.023462 . I.e inference requests were served from this MIG enabled gpu and that couldn't have been possible without a pod/container allocated to the gpu 0. Is this understanding correct?
@krishh85, here is what could happen: the pod was deleted, but the metric value was available.
The dcgm-exporter reads metric values from the driver and directly from the GPU. The driver and GPU don't know about pods and the environment.
When we read the metric, we try to map metric attributes like GPU on POD, by reading podresources and doing mapping by GPU ID. If pod resources don't return a matched pod, the dcgm-exporter produces empty pod labels.
@nvvfedorov , we ran load tests and observed over a period of time(10-15) and increased the requests 2x, and we observed the metric/utilization values also increase. Also, we have several apps running in production in MIG setup so pod's can't be deleted and we have never seen pod resource exposed on those MIG gpus.