gpu-operator bug: The configuration of relabeling still does not take effect!

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04.4 LTS
Kernel Version: 5.4.0-147-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.0-rc.1
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s, v1.26.2
GPU Operator Version: gpu-operator-v23.9.0

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

The relabeling function is supported in the values. yaml file in the official repository:

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.2.6-3.1.9-ubuntu20.04
  imagePullPolicy: IfNotPresent
  env:
    - name: DCGM_EXPORTER_LISTEN
      value: ":9400"
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  resources: {}
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []
    # - source_labels:
    #     - __meta_kubernetes_pod_node_name
    #   regex: (.*)
    #   target_label: instance
    #   replacement: $1
    #   action: replace

AND: I installed the latest version of nvidia/gpu operator using Helm, and I customized the values. yaml file:

cdi:
  enabled: true
  default: true
driver:
  enabled: false
  rdma:
    enabled: true
    useHostMofed: true
toolkit:
  enabled: false
validator:
  plugin:
    env:
      - name: WITH_WORKLOAD
        value: "false"
dcgmExporter:
  enabled: true
  serviceMonitor:
    enabled: true
    relabelings:
      - action: replace
        sourceLabels:
          - __meta_kubernetes_pod_node_name
        targetLabel: instance

My Helm releases:

$ helm ls --all-namespaces
NAME                 	NAMESPACE    	REVISION	UPDATED                              	STATUS  	CHART                       	APP VERSION
gpu-operator         	gpu-operator 	10      	2023-11-06 16:58:33.967677 +0800 CST 	deployed	gpu-operator-v23.9.0        	v23.9.0

But the configuration of relabeling still does not take effect!

Others:

https://github.com/NVIDIA/gpu-operator/issues/537

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

None.

4. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
[ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
[ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
[ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Nov 06 '23 09:11 halohsu

Help!!!

Nov 13 '23 06:11 halohsu

@bluemiaomiao Can you share the yaml manifest of the rendered dcgm-exporter daemonset?

Nov 15 '23 19:11 tariq1890

➜  ~ kubectl get daemonsets -n gpu-operator nvidia-dcgm-exporter -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "3"
    nvidia.com/last-applied-hash: "80895247"
    openshift.io/scc: nvidia-dcgm-exporter
  creationTimestamp: "2023-09-09T08:40:45Z"
  generation: 3
  labels:
    app: nvidia-dcgm-exporter
    app.kubernetes.io/managed-by: gpu-operator
    helm.sh/chart: gpu-operator-v23.9.0
  name: nvidia-dcgm-exporter
  namespace: gpu-operator
  ownerReferences:
  - apiVersion: nvidia.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterPolicy
    name: cluster-policy
    uid: f1b90f5e-45ba-4270-a048-ab210729fa91
  resourceVersion: "237868434"
  uid: 475fafee-3427-4b7b-8488-042ea3ef82df
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nvidia-dcgm-exporter
        app.kubernetes.io/managed-by: gpu-operator
        helm.sh/chart: gpu-operator-v23.9.0
    spec:
      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        - name: DCGM_EXPORTER_COLLECTORS
          value: /etc/dcgm-exporter/dcp-metrics-included.csv
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.6-3.1.9-ubuntu20.04
        imagePullPolicy: IfNotPresent
        name: nvidia-dcgm-exporter
        ports:
        - containerPort: 9400
          name: metrics
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/pod-resources
          name: pod-gpu-resources
          readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for
          nvidia container stack to be setup; sleep 5; done
        command:
        - sh
        - -c
        image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
        imagePullPolicy: IfNotPresent
        name: toolkit-validation
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /run/nvidia
          mountPropagation: HostToContainer
          name: run-nvidia
      nodeSelector:
        nvidia.com/gpu.deploy.dcgm-exporter: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      runtimeClassName: nvidia
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: nvidia-dcgm-exporter
      serviceAccountName: nvidia-dcgm-exporter
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pod-resources
          type: ""
        name: pod-gpu-resources
      - hostPath:
          path: /run/nvidia
          type: ""
        name: run-nvidia
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 26
  desiredNumberScheduled: 26
  numberAvailable: 26
  numberMisscheduled: 0
  numberReady: 26
  observedGeneration: 3
  updatedNumberScheduled: 26

@tariq1890

Nov 20 '23 08:11 halohsu

When I manually add the configuration through kubectl edit - n gpu operator servicemonitors.monitoring.coreos.com nvidia dcgm exporter, but after a while, my relabeling configuration will be deleted

Nov 20 '23 09:11 halohsu

This is a huge bug that has affected our production line monitoring and alerting, and obtaining Pod's IP has no practical value.

Nov 20 '23 09:11 halohsu

At present, after Lens or kubectl edit, the configuration will still be lost after a few minutes. A good method is to close the built-in ServiceMonitor:

kubectl delete -n gpu-operator servicemonitors.monitoring.coreos.com nvidia-dcgm-exporter

Nov 20 '23 11:11 halohsu

I just ran into the same problem. I tried using a similar relabeling for the ServiceMonitor. But in my case, the helm chart failed to install when using the example config:

error validating data: [ValidationError(ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings[0[]): unknown field "source_labels" in com.nvidia.v1.ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings, ValidationError(ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings[0[]): unknown field "target_label" in com.nvidia.v1.ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings

I had to change it to targetLabel and sourceLabel to make the installation work. Yet, I don't see any relabeling taking effect.

        relabelings:
          - sourceLabels: [ __meta_kubernetes_pod_node_name ]
            action: replace
            targetLabel: kubernetes_node
          - sourceLabels: [ __meta_kubernetes_pod_container_name ]
            action: replace
            targetLabel: container
          - sourceLabels: [ __meta_kubernetes_namespace ]
            action: replace
            targetLabel: namespace
          - sourceLabels: [ __meta_kubernetes_pod_name ]
            action: replace
            targetLabel: pod

Nov 30 '23 11:11 derselbst

gpu-operator gpu-operator copied to clipboard

bug: The configuration of relabeling still does not take effect!

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

gpu-operator
gpu-operator copied to clipboard