curator icon indicating copy to clipboard operation
curator copied to clipboard

[CURATOR-325] Background retry falls into infinite loop of SessionExpiredException

Open jira-importer opened this issue 9 years ago • 4 comments

after long time gc pause,which longer than zookeeper session time,the zookeeper cluster invalidate the session id holding by the client and waiting the client to reconnect,but client consider the SessionExpiredException as retry exception and re-put to the background queue,so wo get the stacktrace infinitely.

12:50:54.337 [configuration-0-EventThread] DEBUG org.apache.curator.RetryLoop - Retrying operation
12:50:54.337 [configuration-0-EventThread] DEBUG org.apache.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /dynamic/apps/258741001/DEV
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304) ~[curator-framework-2.10.0.jar:na]
at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293) ~[curator-framework-2.10.0.jar:na]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108) ~[curator-client-2.10.0.jar:na]
at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290) [curator-framework-2.10.0.jar:na]
at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281) [curator-framework-2.10.0.jar:na]
at org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:105) [curator-framework-2.10.0.jar:na]
at org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:65) [curator-framework-2.10.0.jar:na]
at com.ctrip.flight.configuration.client.AbstractZookeeperClient.getData(AbstractZookeeperClient.java:68) [classes/:na]
at com.ctrip.flight.configuration.client.ZooKeeperConfigurationSource.getPublishNodeValue(ZooKeeperConfigurationSource.java:258) [classes/:na]
at com.ctrip.flight.configuration.client.ZooKeeperConfigurationSource.access$100(ZooKeeperConfigurationSource.java:45) [classes/:na]
at com.ctrip.flight.configuration.client.ZooKeeperConfigurationSource$1.nodeChanged(ZooKeeperConfigurationSource.java:105) [classes/:na]
at org.apache.curator.framework.recipes.cache.NodeCache$4.apply(NodeCache.java:310) [curator-recipes-2.10.0.jar:na]
at org.apache.curator.framework.recipes.cache.NodeCache$4.apply(NodeCache.java:304) [curator-recipes-2.10.0.jar:na]
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.10.0.jar:na]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:310) [guava-19.0.jar:na]
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.10.0.jar:na]
at org.apache.curator.framework.recipes.cache.NodeCache.setNewData(NodeCache.java:302) [curator-recipes-2.10.0.jar:na]
at org.apache.curator.framework.recipes.cache.NodeCache.processBackgroundResult(NodeCache.java:269) [curator-recipes-2.10.0.jar:na]
at org.apache.curator.framework.recipes.cache.NodeCache.access$300(NodeCache.java:56) [curator-recipes-2.10.0.jar:na]
at org.apache.curator.framework.recipes.cache.NodeCache$3.processResult(NodeCache.java:122) [curator-recipes-2.10.0.jar:na]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749) [curator-framework-2.10.0.jar:na]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522) [curator-framework-2.10.0.jar:na]
at org.apache.curator.framework.imps.GetDataBuilderImpl$3.processResult(GetDataBuilderImpl.java:256) [curator-framework-2.10.0.jar:na]
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:561) [zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) [zookeeper-3.4.6.jar:3.4.6-1569965]


Originally reported by wlongdu, imported from: Background retry falls into infinite loop of SessionExpiredException
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2025-01-21

jira-importer avatar May 20 '16 05:05 jira-importer

randgalt:

Can you provide a test case that shows the problem. Session expiration should be a retry. Internally, Curator will recreate the ZooKeeper handle when the session expires so I don't see why this is a problem.

jira-importer avatar May 21 '16 17:05 jira-importer

wlongdu:

at first hand, I think I should describe the use case, I use NodeCache provided by recipe, and handle node change event in the thread curator do the background operation. secondly, long gc pause shouldn't happen in normal case,but do in extreme case, I simulate this case in debug mode, I toggle a break point on onNodeChanged implement code, and pause for a long time, when I step over ,I see the stacktrace periodically

jira-importer avatar May 23 '16 01:05 jira-importer

robiplus:

Jordan Zimmerman clive du

Hi, I seem meet a similar problem..

After div into code, I found this problem is caused by read data(getData or getChild) with `RetryForever like` retry policy in our custom watcher implements.

As result, when session closed, EventThread maybe fall into retry infinite loop in custom watcher, and no any chance to give curator's watcher — `ConnectionState#process` to handleExpiredSession and make `ClientCnxn#state` alive again(which is needed to break infinite loop).

This problem can be solve if we don't modify zookeeper/curator:

  • not use forever retry policy..and infinite loop for "a while" - -
  • or like `PathCache` does, send task to another thread after receive WatchedEvent

but I think it seems a hole that user defined watcher may block framework watcher, but framework watcher is vital to user's watcher finish work..

Is any ideal curator can do to improve this problem ^ ^?

jira-importer avatar Sep 24 '17 15:09 jira-importer

randgalt:

As I said, we need a test case and, ideally, a Pull Request with the fix.

jira-importer avatar Oct 06 '17 14:10 jira-importer