curator icon indicating copy to clipboard operation
curator copied to clipboard

[CURATOR-346] LeaderSelector getParticipants reports incorrect number of nodes after connection is suspended

Open jira-importer opened this issue 9 years ago • 0 comments

This is potentially a bug in Curator 2.7.1 where LeaderSelector's getParticipants() reports one less participants than the actual number. On the node missing from the result of getParticipants(), I found something strange in the logs below: ClientCnxn reset the zookeeper connection, then the SUSPENDED state was delivered and the leadership was interrupted and the participant recreated a lock node and called Object.wait() to sleep and wait for its turn to become the leader. However, the RECONNECTED state was not delivered until ConnectionState.checkTimeouts() reset the zookeeper connection again, which also had the side effect of deleting the ephemeral lock znode created earlier by the leader selection participant. Now, we are missing the participant permanently since it is sleeping in wait() and watching only the previous lock node.

Callstack

“Curator-LeaderSelector-0" #319 daemon prio=5 os_prio=0 tid=0x00007f892408f800 nid=0x491f in Object.wait() [0x00007f88f4eb2000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.curator.framework.recipes.locks.LockInternals.internalLockLoop(LockInternals.java:307)
- locked  (a org.apache.curator.framework.recipes.locks.LockInternals)
at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:217)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:232)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:386)
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:443)
at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:64)
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:245)
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:239)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Log

2016/08/29 16:46:49.478 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1255.prod.acompany.com:12916)] [searcher] [] Client session timed out, have not heard from server in 26680ms for sessionid 0x255531b1ff4732e, closing socket connection and attempting reconnect
2016/08/29 16:46:49.796 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1215.prod.acompany.com:12916)] [searcher] [] Opening socket connection to server lva1-app1215.prod.acompany.com/10.132.48.90:12916. Will not attempt to authenticate using SASL (unknown error)
2016/08/29 16:46:49.797 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1215.prod.acompany.com:12916)] [searcher] [] Socket connection established to lva1-app1215.prod.acompany.com/10.132.48.90:12916, initiating session
2016/08/29 16:46:49.800 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1215.prod.acompany.com:12916)] [searcher] [] Session establishment complete on server lva1-app1215.prod.acompany.com/10.132.48.90:12916, sessionid = 0x255531b1ff4732e, negotiated timeout = 40000
2016/08/29 16:46:49.801 INFO [ConnectionStateManager] [SomeService STARTING-EventThread] [searcher] [] State change: SUSPENDED
2016/08/29 16:46:49.803 INFO [MinReplicasSemaphore] [Curator-LeaderSelector-0] [searcher] [] I'm no longer the leader .. currently there are 1 permits
2016/08/29 16:47:04.809 ERROR [ConnectionState] [SomeService STARTING-EventThread] [searcher] [] Connection timed out for connection string (lva1-app1215.prod.acompany.com:12916,lva1-app1255.prod.acompany.com:12916,lva1-app1295.prod.acompany.com:12916,lva1-app1335.prod.acompany.com:12916,lva1-app1375.prod.acompany.com:12916) and timeout (15000) / elapsed (15002)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:288) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.readValue(SharedValue.java:244) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.access$100(SharedValue.java:44) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue$1.process(SharedValue.java:61) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67) ~[curator-framework-2.7.1.jar:?]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
2016/08/29 16:47:20.812 ERROR [ConnectionState] [SomeService STARTING-EventThread] [searcher] [] Connection timed out for connection string (lva1-app1215.prod.acompany.com:12916,lva1-app1255.prod.acompany.com:12916,lva1-app1295.prod.acompany.com:12916,lva1-app1335.prod.acompany.com:12916,lva1-app1375.prod.acompany.com:12916) and timeout (15000) / elapsed (31006)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:288) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.readValue(SharedValue.java:244) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.access$100(SharedValue.java:44) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue$1.process(SharedValue.java:61) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67) ~[curator-framework-2.7.1.jar:?]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
2016/08/29 16:47:36.815 ERROR [ConnectionState] [SomeService STARTING-EventThread] [searcher] [] Connection timed out for connection string (lva1-app1215.prod.acompany.com:12916,lva1-app1255.prod.acompany.com:12916,lva1-app1295.prod.acompany.com:12916,lva1-app1335.prod.acompany.com:12916,lva1-app1375.prod.acompany.com:12916) and timeout (15000) / elapsed (47008)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:288) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41) ~[curator-framework-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.readValue(SharedValue.java:244) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue.access$100(SharedValue.java:44) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.recipes.shared.SharedValue$1.process(SharedValue.java:61) ~[curator-recipes-2.7.1.jar:?]
	at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67) ~[curator-framework-2.7.1.jar:?]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
2016/08/29 16:47:53.817 WARN [ConnectionState] [SomeService STARTING-EventThread] [searcher] [] Connection attempt unsuccessful after 64011 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.
2016/08/29 16:47:53.819 INFO [ZooKeeper] [SomeService STARTING-EventThread] [searcher] [] Session: 0x255531b1ff4732e closed
2016/08/29 16:47:53.819 INFO [ZooKeeper] [SomeService STARTING-EventThread] [searcher] [] Initiating client connection, connectString=lva1-app1215.prod.acompany.com:12916,lva1-app1255.prod.acompany.com:12916,lva1-app1295.prod.acompany.com:12916,lva1-app1335.prod.acompany.com:12916,lva1-app1375.prod.acompany.com:12916 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@2eee547b
2016/08/29 16:47:53.824 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1295.prod.acompany.com:12916)] [searcher] [] Opening socket connection to server lva1-app1295.prod.acompany.com/10.132.48.218:12916. Will not attempt to authenticate using SASL (unknown error)
2016/08/29 16:47:53.824 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1295.prod.acompany.com:12916)] [searcher] [] Socket connection established to lva1-app1295.prod.acompany.com/10.132.48.218:12916, initiating session
2016/08/29 16:47:53.827 INFO [ClientCnxn] [SomeService STARTING-SendThread(lva1-app1295.prod.acompany.com:12916)] [searcher] [] Session establishment complete on server lva1-app1295.prod.acompany.com/10.132.48.218:12916, sessionid = 0x355531abd0978d1, negotiated timeout = 40000
2016/08/29 16:47:53.827 INFO [ConnectionStateManager] [SomeService STARTING-EventThread] [searcher] [] State change: RECONNECTED
2016/08/29 16:47:53.827 INFO [ClientCnxn] [SomeService STARTING-EventThread] [searcher] [] EventThread shut down

Originally reported by bwzhou, imported from: LeaderSelector getParticipants reports incorrect number of nodes after connection is suspended
  • assignee: randgalt
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2025-01-21

jira-importer avatar Aug 30 '16 18:08 jira-importer