Pod and Namespace Labels Missing in dcgm-exporter Metrics
I ssue Description I'm using the following Datadog Helm values to deploy the dcgm-exporter pod:
image:
repository: nvcr.io/nvidia/k8s/dcgm-exporter
pullPolicy: IfNotPresent
tag: 3.1.8-3.1.5-ubuntu20.04
arguments: ["-m", "monitoring:datadog-dcgm-exporter-configmap"]
imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
namespaceOverride: ""
serviceAccount:
create: true
annotations: {}
name:
rollingUpdate:
maxUnavailable: 1
maxSurge: 0
podAnnotations:
ad.datadoghq.com/exporter.checks: |-
{
"dcgm": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:9400/metrics"
}
]
}
}
podSecurityContext: {}
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
service:
enable: true
type: ClusterIP
port: 9400
address: ":9400"
annotations: {}
serviceMonitor:
enabled: false
nodeSelector:
node-role.kubernetes.io/worker: "true"
tolerations: []
affinity: {}
extraHostVolumes: []
extraConfigMapVolumes: []
extraVolumeMounts: []
extraEnv:
- name: DD_KUBERNETES_POD_LABELS_AS_TAGS
value: '{"pod":"pod","namespace":"namespace"}'
- name: NVIDIA_MIG_MONITORING
value: "1"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
kubeletPath: "/var/lib/kubelet/pod-resources"
Additionally, I'm using the following ConfigMap and RBAC configuration:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: dcgm-exporter-read-datadog-cm
namespace: monitoring
rules:
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["datadog-dcgm-exporter-configmap"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: dcgm-exporter-datadog
namespace: monitoring
subjects:
- kind: ServiceAccount
name: dcgm-datadog-dcgm-exporter
namespace: monitoring
roleRef:
kind: Role
name: dcgm-exporter-read-datadog-cm
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
name: datadog-dcgm-exporter-configmap
namespace: monitoring
data:
metrics: |
# Metrics configuration
DCGM_FI_DEV_SM_CLOCK ,gauge ,SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK ,gauge ,Memory clock frequency (in MHz).
...
After deploying, I noticed that the pod and namespace labels appear to be empty in the exported metrics. Here is an example metric output:
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-0f039abb-366b-4158-f72f-04a0a30cc631",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",Hostname="lambda-hyperplane01",DCGM_FI_CUDA_DRIVER_VERSION="12010",DCGM_FI_DEV_BRAND="NVIDIA",DCGM_FI_DEV_MINOR_NUMBER="2",DCGM_FI_DEV_NAME="NVIDIA A100-SXM4-80GB",DCGM_FI_DEV_SERIAL="1324521023176",DCGM_FI_DRIVER_VERSION="520.61.05",DCGM_FI_PROCESS_NAME="/usr/bin/dcgm-exporter",container="",namespace="",pod=""} 210
Could you please shed some light on where I might have missed a configuration setting to ensure that the pod and namespace labels are populated in the exporter?
@qimike , Did you install K8S Device plugin: https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#enabling-gpu-support-in-kubernetes ?
Another thing, that to see not empty pods and container metrics, you should have a load (pods) running on appropriate GPUs.
This feature is missing from the exporter, cf https://github.com/NVIDIA/dcgm-exporter/issues/423
I am also experiencing this issue. I installed the NVIDIA Kubernetes device plugin using the latest Helm chart from https://nvidia.github.io/k8s-device-plugin. Is there anything I should verify?
Additionally, I don't see the device to pod mapping logs in my dcgm-exporter pod.
Any assistance would be greatly appreciated.
I think the issue https://github.com/NVIDIA/dcgm-exporter/issues/423 is different from the issue described here.
I am also experiencing this issue. Is there any solution?