consul
consul copied to clipboard
Joining cluster with duplicate Node Id causes gossip failures
Overview of the Issue
Consul allows a node to partially join the cluster with a duplicate Node Id. This puts the cluster into a someone indeterminate state where the node catalog may be unstable, and consul agents report failures joining the cluster. This can happen when restoring a node from a snapshot while the original node is still alive, or via any mechanism that causes the Node Id to be replicated.
In brief the problem seems to be an abstraction issue. The joining of the cluster is handled at the Serf/Memberlist level, which has no concept of NodeId (it's just abstract metadata), but Consul requires and attempts to enforce the uniqueness of NodeIds, but doesn't have the ability to block the join until it's too late and the duplicate ID is injected into gossip.
This leads to inconsistency between the view at the Serf layer (everything is fully joined up and gossiping) and the Consul (duplicate Node Ids cause trouble with the catalog).
During this time the cluster is somewhat functional (non duplicated nodes seem to be part of the cluster, and their catalog entries, etc seem correct). The cluster seems to recover smoothly when one of the duplicates is removed.
This is abstracted from a customer issue: consul-dup-node-id
Details
- All agents complain of duplicate node ids, with messages like:
Member 'dup-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with this agent's ID
orMember 'dup-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with member 'consul-client'
New clients may report failure to join.
-
The members view contains multiple nodes with the same id:
-
The consul node catalog shows one of the duplicated nodes (but this may flip-flop over time)
Reproduction Steps
Using the docker-compose.yml
file
version: '3.7'
services:
consul-server:
image: hashicorp/consul:1.12.0
container_name: consul-server
networks:
- consul
ports:
- "8500:8500"
command: "agent -server -bootstrap -node consul-server -client 0.0.0.0"
consul-client:
image: hashicorp/consul:1.12.0
container_name: consul-client
networks:
- consul
command: "agent -node-id=23eb77d9-7114-4ed8-8502-fb9ba6ec7593 -node consul-client -retry-join consul-server"
dup-client:
image: hashicorp/consul:1.12.0
container_name: dup-client
networks:
- consul
command: "agent -node-id=23eb77d9-7114-4ed8-8502-fb9ba6ec7593 -node dup-client -retry-join consul-server"
new-client:
image: hashicorp/consul:1.12.0
container_name: new-client
networks:
- consul
command: "agent -node new-client -retry-join consul-server"
networks:
consul:
driver: bridge
Run docker-compose up -d consul-{server,client}
to bring up the first two nodes. It's helpful to run docker-compose logs -f
in another window.
Run docker-compose up -d dup-client
to bring up the conflicting client. Observe all the agents log duplicate errors.
Run docker-compose up -d new-client
to bring up a new client. It will also log an error about duplicate nodes in the cluster.
The cluster is now in a strange state.
A consul members -detailed
on the server will show all the nodes, (see above)
consul-server 172.27.0.3:8301 alive acls=0,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=dc1,ft_fs=1,ft_si=1,id=86ffd7f1-275b-5fa8-d11b-bb5a66d3a60b,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul-client 172.27.0.2:8301 alive ap=default,build=1.12.0:09a8cdb4,dc=dc1,id=23eb77d9-7114-4ed8-8502-fb9ba6ec7593,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
dup-client 172.27.0.4:8301 alive ap=default,build=1.12.0:09a8cdb4,dc=dc1,id=23eb77d9-7114-4ed8-8502-fb9ba6ec7593,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
new-client 172.27.0.5:8301 alive ap=default,build=1.12.0:09a8cdb4,dc=dc1,id=f13aa89a-ad2c-ed0d-a0ec-6c9028cb5cef,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
Note the duplicate id in the meta field.
But consul-client and dup-client will omit the other's information.
A consul catalog nodes -detailed
will list either the consul-client or the dup-client, but not both, and may shift from one to the other periodically.
Node ID Address DC TaggedAddresses Meta
consul-server 86ffd7f1-275b-5fa8-d11b-bb5a66d3a60b 172.27.0.3 dc1 lan=172.27.0.3, lan_ipv4=172.27.0.3, wan=172.27.0.3, wan_ipv4=172.27.0.3 consul-network-segment=
dup-client 23eb77d9-7114-4ed8-8502-fb9ba6ec7593 172.27.0.4 dc1
new-client f13aa89a-ad2c-ed0d-a0ec-6c9028cb5cef 172.27.0.5 dc1 lan=172.27.0.5, lan_ipv4=172.27.0.5, wan=172.27.0.5, wan_ipv4=172.27.0.5 consul-network-segment=
Consul info for both Client and Server
This can be reproduced in Consul 1.9.x and 1.12.0
Operating system and Environment details
The repro here was done in docker, but the example came from a live environment.
Log Fragments
consul-server | 2022-05-11T02:39:01.855Z [ERROR] agent.server.memberlist.lan: memberlist: Failed push/pull merge: Member 'dup-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with member 'consul-client' from=172.27.0.5:37878
new-client | 2022-05-11T02:39:01.855Z [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="1 error occurred:
new-client | * Failed to join 172.27.0.3:8301: Member 'consul-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with member 'dup-client'"
new-client | 2022-05-11T02:39:01.855Z [WARN] agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=<nil>
consul-client | 2022-05-11T02:39:10.196Z [WARN] agent.client.memberlist.lan: memberlist: ignoring alive message for 'dup-client': Member 'dup-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with this agent's ID
dup-client | 2022-05-11T02:39:10.751Z [INFO] agent: Synced node info dup-client | 2022-05-11T02:39:13.823Z [INFO] agent: (LAN) joining: lan_
addresses=[consul-server]
dup-client | 2022-05-11T02:39:13.824Z [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="1 error occurred:
dup-client | * Failed to join 172.27.0.3:8301: Member 'consul-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with this ag
ent's ID
A bit of analysis for the above.
Fixing this will require some thinking. The problem is that don't have duplicate node names, we have duplicate node ids. Node Id isn't part of the serf/memberlist model; it's implemented as metadata and is special only in Consul. The actual check for duplicate ids is done in via delegation to Consul via NotifyMerge, which isn't on all the possible paths that could introduce duplicate node ids. We don't seem to expose any equivalent delegation points to NotifyMerge in the pushpull implementation in memberlist.
Any changes to prevent this would have possible subtle cascading effects. The simplest solution would extend Serf to allow Consul to block joins of duplicate Ids. This would require an API change to memberlist and Serf, requiring adding a new delegate to filter that part of the join process. This would probably need to allow individual nodes to be rejected, as simply rejecting the merge would make things worse, blocking all joins at a low level and leading to more splits. It would also possibly slow some recovery scenarios where a replacement node is being added, as the dead node might linger for a while until it was confirmed dead. More sophisticated implementations are possible, but they add complexity and risk.
Note: this is a variant of https://github.com/hashicorp/consul/issues/7396.
I have the same issue with consul 1.12.2 using the Helm chart. After adding new nodes to the Kubernetes cluster and increasing the consul HA nodes from 1 to 3 I started receiving the error
failed inserting node: Error while renaming Node ID: "2bda35f9-4b74-84cf-d0c0-fdbdf1a40203": Node name xxx is reserved by node 26979430-a652-e34c-ec76-3b30f2162ced with name yyy
Neither deregistering through the API like here or deleting the node-id like here worked.
We have the same issue with Consul 1.10.6. After rebalancing faced with error:
"failed inserting node: Error while renaming Node ID: "1f0a9dbd-655c-e7d8-c934-6f4d5be42491": Node name nodeName
is reserved by node c11fbfa2-3144-01fb-44b1-b4ded071e4a7 with name nodeName
"
Issue partially resolved by manual node ID update, but without manual steps this problem reproducible.
We have the same issue. Service A couldn't join the mesh because Service B had duplicate nodenames due to a configuration bug. This took several hours to diagnose and 30 hours to rollout a fix.
The scope of the issue wasn't predictable as we noticed other services join the mesh without issue.
Q1: Can Consul fail gracefully and let services join despite the duplicates? Q2: Is the log WARN classification below accurate given the issue impacts the function of the Consul agent?
Service A = 1.10.2 Server Cluster = 1.12.0
Client A logs:
2022-08-18T20:18:58.115Z [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="3 errors occurred:
* Failed to join 96.103.24.225: Member 'cl-apigw-data-0170eb9971fd07944' has conflicting node ID '84b07240-926c-4942-d561-36b9f63d65c4' with member 'cl-apigw-data-0397aa8383528fa54'
* Failed to join 96.103.24.45: Member 'cl-apigw-data-0bcaee7f7d8a0dda6' has conflicting node ID '84b07240-926c-4942-d561-36b9f63d65c4' with member 'cl-apigw-data-033410f1b5cf5ab6e'
* Failed to join 96.103.24.133: Member 'cl-apigw-data-00ef40e9ab9bb9e01' has conflicting node ID '84b07240-926c-4942-d561-36b9f63d65c4' with member 'cl-apigw-data-09ad7db552f2855aa'"
2022-08-18T20:18:58.116Z [WARN] agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=<nil>
# repeats indefinitely