consul icon indicating copy to clipboard operation
consul copied to clipboard

Joining cluster with duplicate Node Id causes gossip failures

Open markan opened this issue 2 years ago • 4 comments

Overview of the Issue

Consul allows a node to partially join the cluster with a duplicate Node Id. This puts the cluster into a someone indeterminate state where the node catalog may be unstable, and consul agents report failures joining the cluster. This can happen when restoring a node from a snapshot while the original node is still alive, or via any mechanism that causes the Node Id to be replicated.

In brief the problem seems to be an abstraction issue. The joining of the cluster is handled at the Serf/Memberlist level, which has no concept of NodeId (it's just abstract metadata), but Consul requires and attempts to enforce the uniqueness of NodeIds, but doesn't have the ability to block the join until it's too late and the duplicate ID is injected into gossip.

This leads to inconsistency between the view at the Serf layer (everything is fully joined up and gossiping) and the Consul (duplicate Node Ids cause trouble with the catalog).

During this time the cluster is somewhat functional (non duplicated nodes seem to be part of the cluster, and their catalog entries, etc seem correct). The cluster seems to recover smoothly when one of the duplicates is removed.

This is abstracted from a customer issue: consul-dup-node-id

Details
  • All agents complain of duplicate node ids, with messages like: Member 'dup-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with this agent's ID or Member 'dup-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with member 'consul-client'

New clients may report failure to join.

  • The members view contains multiple nodes with the same id:

  • The consul node catalog shows one of the duplicated nodes (but this may flip-flop over time)

Reproduction Steps

Using the docker-compose.yml file

version: '3.7'

services:
  
  consul-server:
    image: hashicorp/consul:1.12.0
    container_name: consul-server
    networks:
      - consul
    ports:
      - "8500:8500"
    command: "agent -server -bootstrap -node consul-server -client 0.0.0.0"

  consul-client:
    image: hashicorp/consul:1.12.0
    container_name: consul-client
    networks:
      - consul
    command: "agent -node-id=23eb77d9-7114-4ed8-8502-fb9ba6ec7593 -node consul-client -retry-join consul-server"

  dup-client:
    image: hashicorp/consul:1.12.0
    container_name: dup-client
    networks:
      - consul
    command: "agent -node-id=23eb77d9-7114-4ed8-8502-fb9ba6ec7593 -node dup-client -retry-join consul-server"

  new-client:
    image: hashicorp/consul:1.12.0
    container_name: new-client
    networks:
      - consul
    command: "agent -node new-client -retry-join consul-server"

networks:
  consul:
    driver: bridge

Run docker-compose up -d consul-{server,client} to bring up the first two nodes. It's helpful to run docker-compose logs -f in another window.

Run docker-compose up -d dup-client to bring up the conflicting client. Observe all the agents log duplicate errors.

Run docker-compose up -d new-client to bring up a new client. It will also log an error about duplicate nodes in the cluster.

The cluster is now in a strange state. A consul members -detailed on the server will show all the nodes, (see above)

consul-server  172.27.0.3:8301  alive   acls=0,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=dc1,ft_fs=1,ft_si=1,id=86ffd7f1-275b-5fa8-d11b-bb5a66d3a60b,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul-client  172.27.0.2:8301  alive   ap=default,build=1.12.0:09a8cdb4,dc=dc1,id=23eb77d9-7114-4ed8-8502-fb9ba6ec7593,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
dup-client     172.27.0.4:8301  alive   ap=default,build=1.12.0:09a8cdb4,dc=dc1,id=23eb77d9-7114-4ed8-8502-fb9ba6ec7593,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
new-client     172.27.0.5:8301  alive   ap=default,build=1.12.0:09a8cdb4,dc=dc1,id=f13aa89a-ad2c-ed0d-a0ec-6c9028cb5cef,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2

Note the duplicate id in the meta field.

But consul-client and dup-client will omit the other's information.

A consul catalog nodes -detailed will list either the consul-client or the dup-client, but not both, and may shift from one to the other periodically.

Node           ID                                    Address     DC   TaggedAddresses                                                           Meta
consul-server  86ffd7f1-275b-5fa8-d11b-bb5a66d3a60b  172.27.0.3  dc1  lan=172.27.0.3, lan_ipv4=172.27.0.3, wan=172.27.0.3, wan_ipv4=172.27.0.3  consul-network-segment=
dup-client     23eb77d9-7114-4ed8-8502-fb9ba6ec7593  172.27.0.4  dc1                                                                            
new-client     f13aa89a-ad2c-ed0d-a0ec-6c9028cb5cef  172.27.0.5  dc1  lan=172.27.0.5, lan_ipv4=172.27.0.5, wan=172.27.0.5, wan_ipv4=172.27.0.5  consul-network-segment=

Consul info for both Client and Server

This can be reproduced in Consul 1.9.x and 1.12.0

Operating system and Environment details

The repro here was done in docker, but the example came from a live environment.

Log Fragments

consul-server    | 2022-05-11T02:39:01.855Z [ERROR] agent.server.memberlist.lan: memberlist: Failed push/pull merge: Member 'dup-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with member 'consul-client' from=172.27.0.5:37878                                                               
new-client       | 2022-05-11T02:39:01.855Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="1 error occurred:                                  
new-client       |      * Failed to join 172.27.0.3:8301: Member 'consul-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with member 'dup-client'"
new-client       | 2022-05-11T02:39:01.855Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=<nil>                      
consul-client    | 2022-05-11T02:39:10.196Z [WARN]  agent.client.memberlist.lan: memberlist: ignoring alive message for 'dup-client': Member 'dup-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with this agent's ID                                                                           
dup-client       | 2022-05-11T02:39:10.751Z [INFO]  agent: Synced node info   dup-client       | 2022-05-11T02:39:13.823Z [INFO]  agent: (LAN) joining: lan_
addresses=[consul-server]                                                     
dup-client       | 2022-05-11T02:39:13.824Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="1 error occurred:                                  
dup-client       |      * Failed to join 172.27.0.3:8301: Member 'consul-client' has conflicting node ID '23eb77d9-7114-4ed8-8502-fb9ba6ec7593' with this ag
ent's ID                                                                      

markan avatar May 11 '22 20:05 markan

A bit of analysis for the above.

Fixing this will require some thinking. The problem is that don't have duplicate node names, we have duplicate node ids. Node Id isn't part of the serf/memberlist model; it's implemented as metadata and is special only in Consul. The actual check for duplicate ids is done in via delegation to Consul via NotifyMerge, which isn't on all the possible paths that could introduce duplicate node ids. We don't seem to expose any equivalent delegation points to NotifyMerge in the pushpull implementation in memberlist.

Any changes to prevent this would have possible subtle cascading effects. The simplest solution would extend Serf to allow Consul to block joins of duplicate Ids. This would require an API change to memberlist and Serf, requiring adding a new delegate to filter that part of the join process. This would probably need to allow individual nodes to be rejected, as simply rejecting the merge would make things worse, blocking all joins at a low level and leading to more splits. It would also possibly slow some recovery scenarios where a replacement node is being added, as the dead node might linger for a while until it was confirmed dead. More sophisticated implementations are possible, but they add complexity and risk.

markan avatar May 11 '22 20:05 markan

Note: this is a variant of https://github.com/hashicorp/consul/issues/7396.

markan avatar May 13 '22 00:05 markan

I have the same issue with consul 1.12.2 using the Helm chart. After adding new nodes to the Kubernetes cluster and increasing the consul HA nodes from 1 to 3 I started receiving the error

failed inserting node: Error while renaming Node ID: "2bda35f9-4b74-84cf-d0c0-fdbdf1a40203": Node name xxx is reserved by node 26979430-a652-e34c-ec76-3b30f2162ced with name yyy

Neither deregistering through the API like here or deleting the node-id like here worked.

bobertrublik avatar Jun 30 '22 07:06 bobertrublik

We have the same issue with Consul 1.10.6. After rebalancing faced with error: "failed inserting node: Error while renaming Node ID: "1f0a9dbd-655c-e7d8-c934-6f4d5be42491": Node name nodeName is reserved by node c11fbfa2-3144-01fb-44b1-b4ded071e4a7 with name nodeName"

Issue partially resolved by manual node ID update, but without manual steps this problem reproducible.

FilatovM avatar Aug 01 '22 14:08 FilatovM

We have the same issue. Service A couldn't join the mesh because Service B had duplicate nodenames due to a configuration bug. This took several hours to diagnose and 30 hours to rollout a fix.

The scope of the issue wasn't predictable as we noticed other services join the mesh without issue.

Q1: Can Consul fail gracefully and let services join despite the duplicates? Q2: Is the log WARN classification below accurate given the issue impacts the function of the Consul agent?

Service A = 1.10.2 Server Cluster = 1.12.0

Client A logs:

2022-08-18T20:18:58.115Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0 error="3 errors occurred:
        * Failed to join 96.103.24.225: Member 'cl-apigw-data-0170eb9971fd07944' has conflicting node ID '84b07240-926c-4942-d561-36b9f63d65c4' with member 'cl-apigw-data-0397aa8383528fa54'
        * Failed to join 96.103.24.45: Member 'cl-apigw-data-0bcaee7f7d8a0dda6' has conflicting node ID '84b07240-926c-4942-d561-36b9f63d65c4' with member 'cl-apigw-data-033410f1b5cf5ab6e'
        * Failed to join 96.103.24.133: Member 'cl-apigw-data-00ef40e9ab9bb9e01' has conflicting node ID '84b07240-926c-4942-d561-36b9f63d65c4' with member 'cl-apigw-data-09ad7db552f2855aa'"

2022-08-18T20:18:58.116Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=<nil>

# repeats indefinitely

GordonMcKinney avatar Aug 19 '22 18:08 GordonMcKinney