k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

nvidia-device-plugin process CPU 100% with MIG enabled in A100

Open shiquan1988 opened this issue 1 month ago • 1 comments

100% CPU

Image

nvidia-msi:

Image

deployment

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
      namespace: kube-system
    spec:
      containers:
      - env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        - name: MIG_STRATEGY
          value: mixed
        # both v0.18.0 and v0.16.2 have the same problems.
        image: my-internal-registry/k8s-device-plugin:v0.16.2
        imagePullPolicy: IfNotPresent
        name: nvidia-device-plugin-ctr
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          name: device-plugin
      dnsPolicy: ClusterFirst
      nodeSelector:
        GPU: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: device-plugin
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

golang pprof SVG

Image

shiquan1988 avatar Nov 30 '25 16:11 shiquan1988

Thanks @shiquan1988. Would you be able to provide the logs as well?

As a follow-on question: Did this start recently due to an NVIDIA Driver upgrade, for example?

elezar avatar Dec 02 '25 10:12 elezar