[CURATOR-188] Cannot determine the leader if zookeeper leader fails
Hi,
I'm trying to upgrade the curator framework from 2.6.0 to 2.7.1, but I'm having some problems.
In the 2.6.0 version almost everything works fine, but the ServiceDiscovery.updateService() that is already fixed in the 2.7.1.
In the 2.7.1 version, when I kill the zookeeper leader, my path for leader election becomes inconsistent.
For instance, I have three apps registered in the leader path (/com/myapp/leader/):
[_c_85089ba7-0819-40a2-90b5-640bcb5e9e68-lock-0000000003, _c_070619f6-539e-4784-8068-bdc66d2a25bc-lock-0000000005, _c_54a126d3-31e8-464f-9216-5e0ad23fad1b-lock-0000000004]
After killing the zookeeper leader, what I got in the /com/myapp/leader/ is:
[_c_648d5311-a59c-4bc4-bf32-c0605dea9b6a-lock-0000000007, _c_85089ba7-0819-40a2-90b5-640bcb5e9e68-lock-0000000003, _c_f51f9660-3cbf-4ba8-8dba-c1e04ca14a93-lock-0000000008, _c_49696b77-e45a-40b6-8feb-96623c67fd85-lock-0000000006]
Sometimes I got more nodes (five or six).
I'm aware that Curator removes and adds all nodes when a zookeeper node fails. But it seems that the previous nodes are not being removed correctly.
Is that the expected behavior ?
Originally reported by raanogueira, imported from: Cannot determine the leader if zookeeper leader fails
- status: Open
- priority: Major
- resolution: Unresolved
- imported: 2025-01-21
zerd:
We are also experiencing similar issues. After having network issues, no leader is elected. We are using the LeaderSelector pattern, and we get the "reconnected" event, yet no leader, because there's still a hanging lock.
[zk: localhost:20101(CONNECTED) 0] ls /app/leader/SR [_c_eadb5f95-ea3c-4bf5-b7b1-c089df38a2bd-lock-0000000746, _c_3c9fd125-e3ce-4ca3-919f-0f5968c2c12c-lock-0000000745, _c_87358962-171c-4ce2-a34b-92038b400e8 d-lock-0000000744] [zk: localhost:20101(CONNECTED) 1] get /app/leader/SR/_c_eadb5f95-ea3c-4bf5-b7b1-c089df38a2bd-lock-0000000746 10.0.0.148 cZxid = 0x2900012cec ctime = Sun Mar 29 03:56:17 CEST 2015 mZxid = 0x2900012cec mtime = Sun Mar 29 03:56:17 CEST 2015 pZxid = 0x2900012cec cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x34c5d99eea20001 dataLength = 10 numChildren = 0 [zk: localhost:20101(CONNECTED) 2] get /app/leader/SR/_c_3c9fd125-e3ce-4ca3-919f-0f5968c2c12c-lock-0000000745 10.0.0.151 cZxid = 0x290000256c ctime = Sat Mar 28 05:19:43 CET 2015 mZxid = 0x290000256c mtime = Sat Mar 28 05:19:43 CET 2015 pZxid = 0x290000256c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x14c5d99f0850000 dataLength = 10 numChildren = 0 [zk: localhost:20101(CONNECTED) 3] get /app/leader/SR/_c_87358962-171c-4ce2-a34b-92038b400e8d-lock-0000000744 10.0.0.148 cZxid = 0x29000007bb ctime = Sat Mar 28 01:24:50 CET 2015 mZxid = 0x29000007bb mtime = Sat Mar 28 01:24:50 CET 2015 pZxid = 0x29000007bb cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x34c5d99eea20001 dataLength = 10 numChildren = 0
When we stop the node having two locks (10.0.0.148), both locks disappear and the other node is elected leader.
curator-framework/2.11.0
Seeing this similar behavior. If we stop the LeaderSelector on all 3 systems, the path still exists. When we start one instance, it gets the Lock exception You do not own the lock, and there is not a Leader.
I did a rmr on the parent path, and started the instance, and then we were elected leader, and things progressed.
Still researching how it got into this state.