containers-roadmap
containers-roadmap copied to clipboard
[EKS] [eks-node-monitoring-agent]: DCGM Error: Message :"failed to initialize DCGM: Error connecting to nv-hostengine: Host engine connection invalid/disconnected"
Issue Overview: eks-node-monitoring-agent can not connect to dcgm and it transitions the node to AcceleratedHardwareReady = False. DCGM is installed and runs fine with GPU-Operator (latest version)
Findings:
- no
Impact: • Node not healthy and EKS auto repair replaces the node
Expected Behavior: • monitoring agent should be able to connect to dcgm and not set AcceleratedHardwareReady = False
Requesting Labels: EKS, Amazon Elastic Kubernetes Service
{"level":"error","ts":"2025-03-04T07:57:23Z","msg":"failed to reconcile DCGM state","hostname":"ip-10-10-15-1.eu-west-1.compute.internal","monitor":"nvidia","error":"failed to initialize DCGM: Error connecting to nv-hostengine: Host engine connection invalid/disconnected","stacktrace":"golang.a2z.com/EKSNodeMonitoringAgent/internal/monitor/nvidia.(*NvidiaMonitor).reconcileDcgm\n\t/local/p4clients/pkgbuild-const/workspace/src/EKSNodeMonitoringAgent/internal/monitor/nvidia/monitor.go:62\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:204\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:170"}
Thanks for the report, we'll look into improving how this plays with the GPU Operator, but the DCGMError is currently not acted on by EKS auto repair. We'll have this added to the health issue docs to make this more clear.
To make sure we understand correctly, did you observe the node get replaced by auto repair? It may have been due to another unhealthy condition that happened to be present if so.
To make sure we understand correctly, did you observe the node get replaced by auto repair? It may have been due to another unhealthy condition that happened to be present if so.
@ndbaker1 Karpenter certainly terminates a node if the AcceleratedHardwareReady status is set to false. We had to disable Auto repair to bring our gpu instances back up. According to the EKS docs, it should be similar behavior with Managed Node Groups.
p.s: This and the other ticket should be merged.
To make sure we understand correctly, did you observe the node get replaced by auto repair? It may have been due to another unhealthy condition that happened to be present if so.
@ndbaker1 Karpenter certainly terminates a node if the
AcceleratedHardwareReadystatus is set to false. We had to disable Auto repair to bring our gpu instances back up. According to the EKS docs, it should be similar behavior with Managed Node Groups.p.s: This and the other ticket should be merged.
Same issue here
Any update on this. We are facinf the same issue with karpenter
Facing the same issue when enable node auto repair on G6F nodegroup, is there any update?