curator [CURATOR-188] Cannot determine the leader if zookeeper leader fails

Hi,

I'm trying to upgrade the curator framework from 2.6.0 to 2.7.1, but I'm having some problems.

In the 2.6.0 version almost everything works fine, but the ServiceDiscovery.updateService() that is already fixed in the 2.7.1.

In the 2.7.1 version, when I kill the zookeeper leader, my path for leader election becomes inconsistent.
For instance, I have three apps registered in the leader path (/com/myapp/leader/):

[_c_85089ba7-0819-40a2-90b5-640bcb5e9e68-lock-0000000003, _c_070619f6-539e-4784-8068-bdc66d2a25bc-lock-0000000005, _c_54a126d3-31e8-464f-9216-5e0ad23fad1b-lock-0000000004]

After killing the zookeeper leader, what I got in the /com/myapp/leader/ is:

[_c_648d5311-a59c-4bc4-bf32-c0605dea9b6a-lock-0000000007, _c_85089ba7-0819-40a2-90b5-640bcb5e9e68-lock-0000000003, _c_f51f9660-3cbf-4ba8-8dba-c1e04ca14a93-lock-0000000008, _c_49696b77-e45a-40b6-8feb-96623c67fd85-lock-0000000006]

Sometimes I got more nodes (five or six).

I'm aware that Curator removes and adds all nodes when a zookeeper node fails. But it seems that the previous nodes are not being removed correctly.

Is that the expected behavior ?

Originally reported by raanogueira, imported from: Cannot determine the leader if zookeeper leader fails

status: Open
priority: Major
resolution: Unresolved
imported: 2025-01-21

Feb 12 '15 16:02 jira-importer

zerd:

We are also experiencing similar issues. After having network issues, no leader is elected. We are using the LeaderSelector pattern, and we get the "reconnected" event, yet no leader, because there's still a hanging lock.

[zk: localhost:20101(CONNECTED) 0] ls /app/leader/SR
[_c_eadb5f95-ea3c-4bf5-b7b1-c089df38a2bd-lock-0000000746, _c_3c9fd125-e3ce-4ca3-919f-0f5968c2c12c-lock-0000000745, _c_87358962-171c-4ce2-a34b-92038b400e8   d-lock-0000000744]
[zk: localhost:20101(CONNECTED) 1] get /app/leader/SR/_c_eadb5f95-ea3c-4bf5-b7b1-c089df38a2bd-lock-0000000746
10.0.0.148
cZxid = 0x2900012cec
ctime = Sun Mar 29 03:56:17 CEST 2015
mZxid = 0x2900012cec
mtime = Sun Mar 29 03:56:17 CEST 2015
pZxid = 0x2900012cec
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x34c5d99eea20001
dataLength = 10
numChildren = 0
[zk: localhost:20101(CONNECTED) 2] get /app/leader/SR/_c_3c9fd125-e3ce-4ca3-919f-0f5968c2c12c-lock-0000000745
10.0.0.151
cZxid = 0x290000256c
ctime = Sat Mar 28 05:19:43 CET 2015
mZxid = 0x290000256c
mtime = Sat Mar 28 05:19:43 CET 2015
pZxid = 0x290000256c
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x14c5d99f0850000
dataLength = 10
numChildren = 0
[zk: localhost:20101(CONNECTED) 3] get /app/leader/SR/_c_87358962-171c-4ce2-a34b-92038b400e8d-lock-0000000744
10.0.0.148
cZxid = 0x29000007bb
ctime = Sat Mar 28 01:24:50 CET 2015
mZxid = 0x29000007bb
mtime = Sat Mar 28 01:24:50 CET 2015
pZxid = 0x29000007bb
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x34c5d99eea20001
dataLength = 10
numChildren = 0

When we stop the node having two locks (10.0.0.148), both locks disappear and the other node is elected leader.

Apr 08 '15 07:04 jira-importer

kscarr73:

curator-framework/2.11.0

Seeing this similar behavior. If we stop the LeaderSelector on all 3 systems, the path still exists. When we start one instance, it gets the Lock exception You do not own the lock, and there is not a Leader.

I did a rmr on the parent path, and started the instance, and then we were elected leader, and things progressed.

Still researching how it got into this state.

Oct 26 '16 17:10 jira-importer