gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

gpu-operator error causes pods on time-sliced H100 node to restart intermittently

Open vinkamath opened this issue 9 months ago • 6 comments

An operator error occurs roughly once a day on our H100 on which time-slicing is enabled on the mig-1g.10gb instances. This causes the other pods to restart as seen below

 kgp -n nvidia-gpu-operator |grep "13h"   
gpu-feature-discovery-lfkpr                                       2/2     Running     0              13h
nvidia-cuda-validator-82gm5                                       0/1     Completed   0              13h
nvidia-dcgm-exporter-vhvrc                                        1/1     Running     0              13h
nvidia-device-plugin-daemonset-qpn5r                              2/2     Running     0              13h
nvidia-gpu-operator-node-feature-discovery-worker-625qg           1/1     Running     66 (13h ago)   40d
nvidia-mig-manager-sd29c                                          1/1     Running     54 (13h ago)   40d
nvidia-operator-validator-fjk7x                                   1/1     Running     0              13h

This issue does not occur on any other node, all of which have the same configuration.

The issue appears to originate from the nvidia-gpu-operator pod wherein some cluster policies are not ready for [state-operator-validation state-device-plugin state-dcgm-exporter gpu-feature-discovery state-mig-manager]. Here are the complete logs when the operator had errors.

kl -n nvidia-gpu-operator gpu-operator-9848b47f5-t4bl5 --since=13h

{"level":"info","ts":1741165801.943385,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"} {"level":"info","ts":1741165858.1192892,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"} {"level":"info","ts":1741165858.1197896,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.1198285,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-1","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.119851,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-2","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.1198733,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.119893,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-1","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.119916,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-2","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.119936,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-gpu-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.1199582,"logger":"controllers.ClusterPolicy","msg":"Checking GPU state labels on the node","NodeName":"stage-worker-gpu-0"} {"level":"info","ts":1741165858.1199844,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":1} {"level":"info","ts":1741165858.12052,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"} {"level":"info","ts":1741165858.1205838,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RuntimeClass":"nvidia"} {"level":"info","ts":1741165858.1295125,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"pre-requisites","status":"ready"} {"level":"info","ts":1741165858.129598,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"gpu-operator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.1365764,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ServiceMonitor":"gpu-operator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.1423893,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-metrics","status":"ready"} {"level":"info","ts":1741165858.1575058,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-driver","status":"disabled"} {"level":"info","ts":1741165858.1707473,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-container-toolkit","status":"disabled"} {"level":"info","ts":1741165858.1755848,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.1804347,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.1888537,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.1970594,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2055392,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.21039,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator","name":"nvidia-operator-validator"} {"level":"info","ts":1741165858.2104821,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-operator-validator"} {"level":"info","ts":1741165858.2104921,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-validation","status":"notReady"} {"level":"info","ts":1741165858.2163568,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.221135,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2292943,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2375562,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2460434,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2546391,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-device-plugin-entrypoint","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2589417,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-daemonset","Namespace":"nvidia-gpu-operator","name":"nvidia-device-plugin-daemonset"} {"level":"info","ts":1741165858.259008,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-device-plugin-daemonset"} {"level":"info","ts":1741165858.2590172,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-device-plugin","status":"notReady"} {"level":"info","ts":1741165858.2639828,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.268591,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2767787,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2848763,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2926846,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.2968853,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator","name":"nvidia-device-plugin-mps-control-daemon"} {"level":"info","ts":1741165858.296945,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mps-control-daemon","status":"ready"} {"level":"info","ts":1741165858.3062103,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm","status":"disabled"} {"level":"info","ts":1741165858.3112044,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3164978,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3247092,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3281274,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3354995,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator","name":"nvidia-dcgm-exporter"} {"level":"info","ts":1741165858.3355482,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-dcgm-exporter"} {"level":"info","ts":1741165858.3355608,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm-exporter","status":"notReady"} {"level":"info","ts":1741165858.3401449,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.345285,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3528743,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3607113,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3687592,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3728805,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"gpu-feature-discovery","Namespace":"nvidia-gpu-operator","name":"gpu-feature-discovery"} {"level":"info","ts":1741165858.3729382,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"gpu-feature-discovery"} {"level":"info","ts":1741165858.372948,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"gpu-feature-discovery","status":"notReady"} {"level":"info","ts":1741165858.3773358,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3821166,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3897839,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.3976412,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.4053724,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.4087691,"logger":"controllers.ClusterPolicy","msg":"Not creating resource, custom ConfigMap provided: mig-parted-config","ConfigMap":"default-mig-parted-config","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.4135768,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-gpu-clients","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.4217677,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-mig-manager-entrypoint","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.4256167,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator","name":"nvidia-mig-manager"} {"level":"info","ts":1741165858.4256675,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-mig-manager"} {"level":"info","ts":1741165858.4256763,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mig-manager","status":"notReady"} {"level":"info","ts":1741165858.4450333,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-node-status-exporter","status":"disabled"} {"level":"info","ts":1741165858.459023,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-manager","status":"disabled"} {"level":"info","ts":1741165858.470655,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-device-manager","status":"disabled"} {"level":"info","ts":1741165858.4843562,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-validation","status":"disabled"} {"level":"info","ts":1741165858.5010297,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vfio-manager","status":"disabled"} {"level":"info","ts":1741165858.5145643,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"} {"level":"info","ts":1741165858.5323446,"logger":"controllers.ClusterPolicy","msg":"Kata Manager disabled, deleting all Kata RuntimeClasses"} {"level":"info","ts":1741165858.5323682,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"} {"level":"info","ts":1741165858.545435,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"} {"level":"error","ts":1741165858.5454733,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy is not ready, states not ready: [state-operator-validation state-device-plugin state-dcgm-exporter gpu-feature-discovery state-mig-manager]"} {"level":"error","ts":1741165858.5930266,"logger":"controllers.ClusterPolicy","msg":"Operation cannot be fulfilled on clusterpolicies.nvidia.com "cluster-policy": the object has been modified; please apply your changes to the latest version and try again"} {"level":"info","ts":1741165858.593216,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"} {"level":"info","ts":1741165858.5936053,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-1","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.5936267,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-2","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.5936456,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.5936594,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-1","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.5936713,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-2","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.5936844,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-gpu-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.593702,"logger":"controllers.ClusterPolicy","msg":"Checking GPU state labels on the node","NodeName":"stage-worker-gpu-0"} {"level":"info","ts":1741165858.5937328,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165858.593749,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":1} {"level":"info","ts":1741165858.5940998,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"} {"level":"info","ts":1741165858.594133,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RuntimeClass":"nvidia"} {"level":"info","ts":1741165858.6020684,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"pre-requisites","status":"ready"} {"level":"info","ts":1741165858.602127,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"gpu-operator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.6060507,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ServiceMonitor":"gpu-operator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.611506,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-metrics","status":"ready"} {"level":"info","ts":1741165858.6256945,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-driver","status":"disabled"} {"level":"info","ts":1741165858.637175,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-container-toolkit","status":"disabled"} {"level":"info","ts":1741165858.641484,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.6463165,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.6541426,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.6626365,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.6703837,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.674431,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator","name":"nvidia-operator-validator"} {"level":"info","ts":1741165858.6744902,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-operator-validator"} {"level":"info","ts":1741165858.6745005,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-validation","status":"notReady"} {"level":"info","ts":1741165858.6794107,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.6842527,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.6917276,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.699169,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7070432,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7147276,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-device-plugin-entrypoint","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.719198,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-daemonset","Namespace":"nvidia-gpu-operator","name":"nvidia-device-plugin-daemonset"} {"level":"info","ts":1741165858.7192645,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-device-plugin-daemonset"} {"level":"info","ts":1741165858.7192745,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-device-plugin","status":"notReady"} {"level":"info","ts":1741165858.723746,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7328882,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7404616,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7482626,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.756402,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7606952,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator","name":"nvidia-device-plugin-mps-control-daemon"} {"level":"info","ts":1741165858.7607644,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mps-control-daemon","status":"ready"} {"level":"info","ts":1741165858.7700713,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm","status":"disabled"} {"level":"info","ts":1741165858.7744193,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7788103,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7870288,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7902489,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.7975512,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator","name":"nvidia-dcgm-exporter"} {"level":"info","ts":1741165858.7975993,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-dcgm-exporter"} {"level":"info","ts":1741165858.7976089,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm-exporter","status":"notReady"} {"level":"info","ts":1741165858.80194,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8063357,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8141086,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.822089,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8292992,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8332913,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"gpu-feature-discovery","Namespace":"nvidia-gpu-operator","name":"gpu-feature-discovery"} {"level":"info","ts":1741165858.8333452,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"gpu-feature-discovery"} {"level":"info","ts":1741165858.8333547,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"gpu-feature-discovery","status":"notReady"} {"level":"info","ts":1741165858.8381104,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.842631,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8505065,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.858727,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8660645,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8691187,"logger":"controllers.ClusterPolicy","msg":"Not creating resource, custom ConfigMap provided: mig-parted-config","ConfigMap":"default-mig-parted-config","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8736587,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-gpu-clients","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8815873,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-mig-manager-entrypoint","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165858.8856401,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator","name":"nvidia-mig-manager"} {"level":"info","ts":1741165858.8856912,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-mig-manager"} {"level":"info","ts":1741165858.885706,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mig-manager","status":"notReady"} {"level":"info","ts":1741165858.9049387,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-node-status-exporter","status":"disabled"} {"level":"info","ts":1741165858.921975,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-manager","status":"disabled"} {"level":"info","ts":1741165858.9333901,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-device-manager","status":"disabled"} {"level":"info","ts":1741165858.9473295,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-validation","status":"disabled"} {"level":"info","ts":1741165858.9633162,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vfio-manager","status":"disabled"} {"level":"info","ts":1741165858.9769232,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"} {"level":"info","ts":1741165858.9951148,"logger":"controllers.ClusterPolicy","msg":"Kata Manager disabled, deleting all Kata RuntimeClasses"} {"level":"info","ts":1741165858.9951413,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"} {"level":"info","ts":1741165859.0092173,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"} {"level":"error","ts":1741165859.009253,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy is not ready, states not ready: [state-operator-validation state-device-plugin state-dcgm-exporter gpu-feature-discovery state-mig-manager]"} {"level":"info","ts":1741165863.5940347,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"} {"level":"info","ts":1741165863.594432,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-2","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165863.5944538,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-gpu-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165863.5944726,"logger":"controllers.ClusterPolicy","msg":"Checking GPU state labels on the node","NodeName":"stage-worker-gpu-0"} {"level":"info","ts":1741165863.5944915,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165863.5945065,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-1","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165863.5945194,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-control-2","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165863.5945325,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-0","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165863.5945446,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"stage-worker-cpu-1","GpuWorkloadConfig":"container"} {"level":"info","ts":1741165863.5945578,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":1} {"level":"info","ts":1741165863.594979,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"} {"level":"info","ts":1741165863.595013,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RuntimeClass":"nvidia"} {"level":"info","ts":1741165863.603993,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"pre-requisites","status":"ready"} {"level":"info","ts":1741165863.60407,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"gpu-operator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6084695,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ServiceMonitor":"gpu-operator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6143007,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-metrics","status":"ready"} {"level":"info","ts":1741165863.6285326,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-driver","status":"disabled"} {"level":"info","ts":1741165863.641478,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-container-toolkit","status":"disabled"} {"level":"info","ts":1741165863.6460752,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6507356,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6589203,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6669335,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6748853,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6789844,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator","name":"nvidia-operator-validator"} {"level":"info","ts":1741165863.6790571,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-operator-validator"} {"level":"info","ts":1741165863.6790667,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-validation","status":"notReady"} {"level":"info","ts":1741165863.6838913,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6882865,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.6959238,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7038147,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.71172,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7198384,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-device-plugin-entrypoint","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7237535,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-daemonset","Namespace":"nvidia-gpu-operator","name":"nvidia-device-plugin-daemonset"} {"level":"info","ts":1741165863.7238388,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-device-plugin-daemonset"} {"level":"info","ts":1741165863.723848,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-device-plugin","status":"notReady"} {"level":"info","ts":1741165863.7282703,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7326922,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.740933,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7492962,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7578511,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7622254,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-mps-control-daemon","Namespace":"nvidia-gpu-operator","name":"nvidia-device-plugin-mps-control-daemon"} {"level":"info","ts":1741165863.7622926,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mps-control-daemon","status":"ready"} {"level":"info","ts":1741165863.7723114,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm","status":"disabled"} {"level":"info","ts":1741165863.7765563,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.780971,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7886822,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.7921464,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.799093,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-dcgm-exporter","Namespace":"nvidia-gpu-operator","name":"nvidia-dcgm-exporter"} {"level":"info","ts":1741165863.7991433,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-dcgm-exporter"} {"level":"info","ts":1741165863.799153,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm-exporter","status":"notReady"} {"level":"info","ts":1741165863.803767,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8082604,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8158963,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8236296,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.831239,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-gpu-feature-discovery","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8348842,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"gpu-feature-discovery","Namespace":"nvidia-gpu-operator","name":"gpu-feature-discovery"} {"level":"info","ts":1741165863.8349252,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"gpu-feature-discovery"} {"level":"info","ts":1741165863.8349302,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"gpu-feature-discovery","status":"notReady"} {"level":"info","ts":1741165863.839545,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.844004,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8513782,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8589904,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8665333,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.869782,"logger":"controllers.ClusterPolicy","msg":"Not creating resource, custom ConfigMap provided: mig-parted-config","ConfigMap":"default-mig-parted-config","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8744042,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-gpu-clients","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.8820553,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-mig-manager-entrypoint","Namespace":"nvidia-gpu-operator"} {"level":"info","ts":1741165863.885829,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-mig-manager","Namespace":"nvidia-gpu-operator","name":"nvidia-mig-manager"} {"level":"info","ts":1741165863.8858926,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-mig-manager"} {"level":"info","ts":1741165863.8859062,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mig-manager","status":"notReady"} {"level":"info","ts":1741165863.903852,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-node-status-exporter","status":"disabled"} {"level":"info","ts":1741165863.9173665,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-manager","status":"disabled"} {"level":"info","ts":1741165863.9287798,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-device-manager","status":"disabled"} {"level":"info","ts":1741165863.941605,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-validation","status":"disabled"} {"level":"info","ts":1741165863.9568462,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vfio-manager","status":"disabled"} {"level":"info","ts":1741165863.9710054,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"} {"level":"info","ts":1741165863.9900517,"logger":"controllers.ClusterPolicy","msg":"Kata Manager disabled, deleting all Kata RuntimeClasses"} {"level":"info","ts":1741165863.9900763,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"} {"level":"info","ts":1741165864.0053134,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"} {"level":"error","ts":1741165864.0053554,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy is not ready, states not ready: [state-operator-validation state-device-plugin state-dcgm-exporter gpu-feature-discovery state-mig-manager]"}

This section really stuck out to me:

DaemonSet identical, skipping update","DaemonSet":"nvidia-operator-validator","Namespace":"nvidia-gpu-operator","name":"nvidia-operator-validator"
daemonset not ready","name":"nvidia-operator-validator"

The controller finds the nvidia-operator-validator DaemonSet and determines it's "identical" (configuration is up-to-date). However, it then logs "daemonset not ready" for nvidia-operator-validator.Consequently, the "state-operator-validation" is marked as "notReady".

More logs and config info in the comments

vinkamath avatar Mar 05 '25 22:03 vinkamath