gpu-operator
gpu-operator copied to clipboard
bug: The configuration of relabeling still does not take effect!
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04.4 LTS
- Kernel Version: 5.4.0-147-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.0-rc.1
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s, v1.26.2
- GPU Operator Version: gpu-operator-v23.9.0
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
The relabeling function is supported in the values. yaml file in the official repository:
dcgmExporter:
enabled: true
repository: nvcr.io/nvidia/k8s
image: dcgm-exporter
version: 3.2.6-3.1.9-ubuntu20.04
imagePullPolicy: IfNotPresent
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
resources: {}
serviceMonitor:
enabled: false
interval: 15s
honorLabels: false
additionalLabels: {}
relabelings: []
# - source_labels:
# - __meta_kubernetes_pod_node_name
# regex: (.*)
# target_label: instance
# replacement: $1
# action: replace
AND: I installed the latest version of nvidia/gpu operator using Helm, and I customized the values. yaml file:
cdi:
enabled: true
default: true
driver:
enabled: false
rdma:
enabled: true
useHostMofed: true
toolkit:
enabled: false
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: "false"
dcgmExporter:
enabled: true
serviceMonitor:
enabled: true
relabelings:
- action: replace
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: instance
My Helm releases:
$ helm ls --all-namespaces
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator gpu-operator 10 2023-11-06 16:58:33.967677 +0800 CST deployed gpu-operator-v23.9.0 v23.9.0
But the configuration of relabeling still does not take effect!
Others:
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
None.
4. Information to attach (optional if deemed irrelevant)
- [ ] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - [ ] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - [ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - [ ] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - [ ] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - [ ] containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
Help!!!
@bluemiaomiao Can you share the yaml manifest of the rendered dcgm-exporter daemonset?
➜ ~ kubectl get daemonsets -n gpu-operator nvidia-dcgm-exporter -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "3"
nvidia.com/last-applied-hash: "80895247"
openshift.io/scc: nvidia-dcgm-exporter
creationTimestamp: "2023-09-09T08:40:45Z"
generation: 3
labels:
app: nvidia-dcgm-exporter
app.kubernetes.io/managed-by: gpu-operator
helm.sh/chart: gpu-operator-v23.9.0
name: nvidia-dcgm-exporter
namespace: gpu-operator
ownerReferences:
- apiVersion: nvidia.com/v1
blockOwnerDeletion: true
controller: true
kind: ClusterPolicy
name: cluster-policy
uid: f1b90f5e-45ba-4270-a048-ab210729fa91
resourceVersion: "237868434"
uid: 475fafee-3427-4b7b-8488-042ea3ef82df
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: nvidia-dcgm-exporter
template:
metadata:
creationTimestamp: null
labels:
app: nvidia-dcgm-exporter
app.kubernetes.io/managed-by: gpu-operator
helm.sh/chart: gpu-operator-v23.9.0
spec:
containers:
- env:
- name: DCGM_EXPORTER_LISTEN
value: :9400
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcp-metrics-included.csv
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.6-3.1.9-ubuntu20.04
imagePullPolicy: IfNotPresent
name: nvidia-dcgm-exporter
ports:
- containerPort: 9400
name: metrics
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/pod-resources
name: pod-gpu-resources
readOnly: true
dnsPolicy: ClusterFirst
initContainers:
- args:
- until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for
nvidia container stack to be setup; sleep 5; done
command:
- sh
- -c
image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.0
imagePullPolicy: IfNotPresent
name: toolkit-validation
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /run/nvidia
mountPropagation: HostToContainer
name: run-nvidia
nodeSelector:
nvidia.com/gpu.deploy.dcgm-exporter: "true"
priorityClassName: system-node-critical
restartPolicy: Always
runtimeClassName: nvidia
schedulerName: default-scheduler
securityContext: {}
serviceAccount: nvidia-dcgm-exporter
serviceAccountName: nvidia-dcgm-exporter
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/pod-resources
type: ""
name: pod-gpu-resources
- hostPath:
path: /run/nvidia
type: ""
name: run-nvidia
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 26
desiredNumberScheduled: 26
numberAvailable: 26
numberMisscheduled: 0
numberReady: 26
observedGeneration: 3
updatedNumberScheduled: 26
@tariq1890
When I manually add the configuration through kubectl edit - n gpu operator servicemonitors.monitoring.coreos.com nvidia dcgm exporter, but after a while, my relabeling configuration will be deleted
This is a huge bug that has affected our production line monitoring and alerting, and obtaining Pod's IP has no practical value.
At present, after Lens or kubectl edit, the configuration will still be lost after a few minutes. A good method is to close the built-in ServiceMonitor:
kubectl delete -n gpu-operator servicemonitors.monitoring.coreos.com nvidia-dcgm-exporter
I just ran into the same problem. I tried using a similar relabeling for the ServiceMonitor. But in my case, the helm chart failed to install when using the example config:
error validating data: [ValidationError(ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings[0[]): unknown field "source_labels" in com.nvidia.v1.ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings, ValidationError(ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings[0[]): unknown field "target_label" in com.nvidia.v1.ClusterPolicy.spec.dcgmExporter.serviceMonitor.relabelings
I had to change it to targetLabel and sourceLabel to make the installation work. Yet, I don't see any relabeling taking effect.
relabelings:
- sourceLabels: [ __meta_kubernetes_pod_node_name ]
action: replace
targetLabel: kubernetes_node
- sourceLabels: [ __meta_kubernetes_pod_container_name ]
action: replace
targetLabel: container
- sourceLabels: [ __meta_kubernetes_namespace ]
action: replace
targetLabel: namespace
- sourceLabels: [ __meta_kubernetes_pod_name ]
action: replace
targetLabel: pod