dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Pod and Namespace Labels Missing in dcgm-exporter Metrics

Open qimike opened this issue 1 year ago • 3 comments

I ssue Description I'm using the following Datadog Helm values to deploy the dcgm-exporter pod:

image:
  repository: nvcr.io/nvidia/k8s/dcgm-exporter
  pullPolicy: IfNotPresent
  tag: 3.1.8-3.1.5-ubuntu20.04

arguments: ["-m", "monitoring:datadog-dcgm-exporter-configmap"]

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
namespaceOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name:

rollingUpdate:
  maxUnavailable: 1
  maxSurge: 0

podAnnotations:
  ad.datadoghq.com/exporter.checks: |-
    {
      "dcgm": {
        "instances": [
          {
            "openmetrics_endpoint": "http://%%host%%:9400/metrics"
          }
        ]
      }
    }

podSecurityContext: {}

securityContext:
  runAsNonRoot: false
  runAsUser: 0
  capabilities:
    add: ["SYS_ADMIN"]

service:
  enable: true
  type: ClusterIP
  port: 9400
  address: ":9400"
  annotations: {}

serviceMonitor:
  enabled: false

nodeSelector:
  node-role.kubernetes.io/worker: "true"

tolerations: []

affinity: {}

extraHostVolumes: []

extraConfigMapVolumes: []

extraVolumeMounts: []

extraEnv:
  - name: DD_KUBERNETES_POD_LABELS_AS_TAGS
    value: '{"pod":"pod","namespace":"namespace"}'
  - name: NVIDIA_MIG_MONITORING
    value: "1"
  - name: DCGM_EXPORTER_KUBERNETES
    value: "true"

kubeletPath: "/var/lib/kubelet/pod-resources"

Additionally, I'm using the following ConfigMap and RBAC configuration:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: dcgm-exporter-read-datadog-cm
  namespace: monitoring
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["datadog-dcgm-exporter-configmap"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dcgm-exporter-datadog
  namespace: monitoring
subjects:
- kind: ServiceAccount
  name: dcgm-datadog-dcgm-exporter
  namespace: monitoring
roleRef:
  kind: Role
  name: dcgm-exporter-read-datadog-cm
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-dcgm-exporter-configmap
  namespace: monitoring
data:
  metrics: |
    # Metrics configuration
    DCGM_FI_DEV_SM_CLOCK ,gauge ,SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK ,gauge ,Memory clock frequency (in MHz).
    ...

After deploying, I noticed that the pod and namespace labels appear to be empty in the exported metrics. Here is an example metric output: DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-0f039abb-366b-4158-f72f-04a0a30cc631",device="nvidia0",modelName="NVIDIA A100-SXM4-80GB",Hostname="lambda-hyperplane01",DCGM_FI_CUDA_DRIVER_VERSION="12010",DCGM_FI_DEV_BRAND="NVIDIA",DCGM_FI_DEV_MINOR_NUMBER="2",DCGM_FI_DEV_NAME="NVIDIA A100-SXM4-80GB",DCGM_FI_DEV_SERIAL="1324521023176",DCGM_FI_DRIVER_VERSION="520.61.05",DCGM_FI_PROCESS_NAME="/usr/bin/dcgm-exporter",container="",namespace="",pod=""} 210

Could you please shed some light on where I might have missed a configuration setting to ensure that the pod and namespace labels are populated in the exporter?

qimike avatar Oct 30 '24 20:10 qimike

@qimike , Did you install K8S Device plugin: https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#enabling-gpu-support-in-kubernetes ?

nvvfedorov avatar Oct 30 '24 21:10 nvvfedorov

Another thing, that to see not empty pods and container metrics, you should have a load (pods) running on appropriate GPUs.

nvvfedorov avatar Oct 30 '24 22:10 nvvfedorov

This feature is missing from the exporter, cf https://github.com/NVIDIA/dcgm-exporter/issues/423

mtparet avatar Nov 20 '24 12:11 mtparet

I am also experiencing this issue. I installed the NVIDIA Kubernetes device plugin using the latest Helm chart from https://nvidia.github.io/k8s-device-plugin. Is there anything I should verify?

Additionally, I don't see the device to pod mapping logs in my dcgm-exporter pod.

Any assistance would be greatly appreciated.


I think the issue https://github.com/NVIDIA/dcgm-exporter/issues/423 is different from the issue described here.

michaelact avatar May 02 '25 11:05 michaelact

I am also experiencing this issue. Is there any solution?

conquerorAlex avatar Aug 29 '25 08:08 conquerorAlex