consul-k8s Node Registration fails on image update

Overview of the Issue

When a simultaneous update of the agent daemonset and server statefulset happens I'm getting node registration errors continuously afterwards.

Not sure if this is the old agent not being able to leave gracefully or if it's entirely server side. Re-rolling the agent daemonset fixes the issue.

I'm updating from 1.12.0 (default in the 0.44.0 Helm release) to 1.12.2 (latest as of today).

I think this also caused a 'ghost' Consul Connect service to be left dangling in the catalog.
I had an extra instance of a service that corresponded to an old pod, it was still marked as healthy and all the checks in Consul were green.

The agent on the relevant node had been erroring, restarting the pod immediately cleared up the errors and the ghost service instance.

Reproduction Steps

Update global.image in the helm chart values.

Look for Syncing node info failed in the agent logs and EnsureRegistration failed in the server logs.

Logs

The underlying error is the same:

rpc error making call: failed inserting node: Error while renaming Node ID: "70c58ae6-8ffa-c19e-0fa0-81e3252407a1": Node name ip-10-30-1-119.us-west-2.compute.internal is reserved by node 8503f2e3-7ed8-909a-a8f4-26b959e45dde with name ip-10-30-1-119.us-west-2.compute.internal (10.30.1.119)

Agent reports this twice as Syncing node info failed. and RPC failed to server

On the server it's logged as EnsureRegistration failed

Expected behavior

Agents / Nodes correctly rejoin the cluster after an update

Environment details

consul-k8s: 0.44.0 Consul: 1.12.0 / 1.12.2 Kubernetes: EKS v1.21

Jun 08 '22 22:06 hamishforbes

Hey @hamishforbes

We don't recommend upgrading servers and clients at the same time when using the service mesh. Please see these docs: https://www.consul.io/docs/k8s/upgrade#service-mesh

Jun 10 '22 19:06 ishustava

Hi, this is not a Connect issue. Perhaps I shouldn't have included that in the initial report, that was just the knock-on effect that i was investigating.

This issue occurs on clusters with Connect disabled as well.

The Connect issues described on that page are transient anyway, they talk about a short period of unavailability. Something I don't really care about in pre-production environments or during a maintenance window for example. The problem I'm seeing is permanent until a human intervenes (to restart agent pods).

The instructions in the upgrade docs you linked will specifically trigger this problem

Set global.image in your values.yaml to the desired version:

You can also trigger the problem without making any Helm changes by simply triggering a rollout of the server and agent pods at the same time (e.g. kubectl rollout restart ds/consul-consul-client sts/consul-consul-server)

Jun 12 '22 21:06 hamishforbes

consul-k8s consul-k8s copied to clipboard

Node Registration fails on image update

Overview of the Issue

Reproduction Steps

Logs

Expected behavior

Environment details

consul-k8s
consul-k8s copied to clipboard