containers-roadmap [EKS] [eks-node-monitoring-agent]: DCGM Error: Message :"failed to initialize DCGM: Error connecting to nv-hostengine: Host engine connection invalid/disconnected"

[EKS] [eks-node-monitoring-agent]: DCGM Error: Message :"failed to initialize DCGM: Error connecting to nv-hostengine: Host engine connection invalid/disconnected"

Open rsimiciuc opened this issue 9 months ago • 3 comments

Issue Overview: eks-node-monitoring-agent can not connect to dcgm and it transitions the node to AcceleratedHardwareReady = False. DCGM is installed and runs fine with GPU-Operator (latest version)

Findings:

Impact: • Node not healthy and EKS auto repair replaces the node

Expected Behavior: • monitoring agent should be able to connect to dcgm and not set AcceleratedHardwareReady = False

Requesting Labels: EKS, Amazon Elastic Kubernetes Service

{"level":"error","ts":"2025-03-04T07:57:23Z","msg":"failed to reconcile DCGM state","hostname":"ip-10-10-15-1.eu-west-1.compute.internal","monitor":"nvidia","error":"failed to initialize DCGM: Error connecting to nv-hostengine: Host engine connection invalid/disconnected","stacktrace":"golang.a2z.com/EKSNodeMonitoringAgent/internal/monitor/nvidia.(*NvidiaMonitor).reconcileDcgm\n\t/local/p4clients/pkgbuild-const/workspace/src/EKSNodeMonitoringAgent/internal/monitor/nvidia/monitor.go:62\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:204\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:259\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/local/p4clients/pkgbuild-const/workspace/tmp/gomodcache/k8s.io/[email protected]/pkg/util/wait/backoff.go:170"}

Mar 04 '25 08:03 rsimiciuc

Thanks for the report, we'll look into improving how this plays with the GPU Operator, but the DCGMError is currently not acted on by EKS auto repair. We'll have this added to the health issue docs to make this more clear.

To make sure we understand correctly, did you observe the node get replaced by auto repair? It may have been due to another unhealthy condition that happened to be present if so.

Mar 04 '25 17:03 ndbaker1

To make sure we understand correctly, did you observe the node get replaced by auto repair? It may have been due to another unhealthy condition that happened to be present if so.

@ndbaker1 Karpenter certainly terminates a node if the AcceleratedHardwareReady status is set to false. We had to disable Auto repair to bring our gpu instances back up. According to the EKS docs, it should be similar behavior with Managed Node Groups.

p.s: This and the other ticket should be merged.

Apr 14 '25 10:04 sramanujam-simscale

To make sure we understand correctly, did you observe the node get replaced by auto repair? It may have been due to another unhealthy condition that happened to be present if so.

@ndbaker1 Karpenter certainly terminates a node if the AcceleratedHardwareReady status is set to false. We had to disable Auto repair to bring our gpu instances back up. According to the EKS docs, it should be similar behavior with Managed Node Groups.

p.s: This and the other ticket should be merged.

Same issue here

May 13 '25 08:05 Patafix

Any update on this. We are facinf the same issue with karpenter

Oct 28 '25 12:10 priyanshurohilla

Facing the same issue when enable node auto repair on G6F nodegroup, is there any update?

Nov 20 '25 00:11 xjiaqing

containers-roadmap containers-roadmap copied to clipboard

[EKS] [eks-node-monitoring-agent]: DCGM Error: Message :"failed to initialize DCGM: Error connecting to nv-hostengine: Host engine connection invalid/disconnected"

containers-roadmap
containers-roadmap copied to clipboard