consul-k8s
consul-k8s copied to clipboard
Node Registration fails on image update
Overview of the Issue
When a simultaneous update of the agent daemonset and server statefulset happens I'm getting node registration errors continuously afterwards.
Not sure if this is the old agent not being able to leave gracefully or if it's entirely server side. Re-rolling the agent daemonset fixes the issue.
I'm updating from 1.12.0
(default in the 0.44.0
Helm release) to 1.12.2
(latest as of today).
I think this also caused a 'ghost' Consul Connect service to be left dangling in the catalog.
I had an extra instance of a service that corresponded to an old pod, it was still marked as healthy and all the checks in Consul were green.
The agent on the relevant node had been erroring, restarting the pod immediately cleared up the errors and the ghost service instance.
Reproduction Steps
Update global.image
in the helm chart values.
Look for Syncing node info failed
in the agent logs and EnsureRegistration failed
in the server logs.
Logs
The underlying error is the same:
rpc error making call: failed inserting node: Error while renaming Node ID: "70c58ae6-8ffa-c19e-0fa0-81e3252407a1": Node name ip-10-30-1-119.us-west-2.compute.internal is reserved by node 8503f2e3-7ed8-909a-a8f4-26b959e45dde with name ip-10-30-1-119.us-west-2.compute.internal (10.30.1.119)
Agent reports this twice as Syncing node info failed.
and RPC failed to server
On the server it's logged as EnsureRegistration failed
Expected behavior
Agents / Nodes correctly rejoin the cluster after an update
Environment details
consul-k8s: 0.44.0 Consul: 1.12.0 / 1.12.2 Kubernetes: EKS v1.21
Hey @hamishforbes
We don't recommend upgrading servers and clients at the same time when using the service mesh. Please see these docs: https://www.consul.io/docs/k8s/upgrade#service-mesh
Hi, this is not a Connect issue. Perhaps I shouldn't have included that in the initial report, that was just the knock-on effect that i was investigating.
This issue occurs on clusters with Connect disabled as well.
The Connect issues described on that page are transient anyway, they talk about a short period of unavailability. Something I don't really care about in pre-production environments or during a maintenance window for example. The problem I'm seeing is permanent until a human intervenes (to restart agent pods).
The instructions in the upgrade docs you linked will specifically trigger this problem
- Set global.image in your values.yaml to the desired version:
You can also trigger the problem without making any Helm changes by simply triggering a rollout of the server and agent pods at the same time (e.g. kubectl rollout restart ds/consul-consul-client sts/consul-consul-server
)