curator icon indicating copy to clipboard operation
curator copied to clipboard

[CURATOR-422] PathChildrenCache is not tolerant to failed connection to ZK on startup

Open jira-importer opened this issue 8 years ago • 0 comments

If PathChildrenCache is started when Zookeeper is not available for a quite long time (to exceed operations retries) and parent node did not exist - when the connection to Zookeeper is resumed PathChildrenCache does not watch for changes anymore.
Root cause: PathChildrenCache uses EnsureContainers which has the following logic:

private synchronized void internalEnsure() throws Exception
    {
if ( ensureNeeded.compareAndSet(true, false) )
{
    client.createContainers(path);
}
    }

This logic is not aware about operation result, even if client.createContainers throws an exception and the nodes are not created EnsureContainers next time will not try to do it.
Example of the exception:

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /test
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
	at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:274)
	at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199)
	at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193)
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
	at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190)
	at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175)
	at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194)
	at org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61)
	at org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache.refresh(PathChildrenCache.java:490)
	at org.apache.curator.framework.recipes.cache.RefreshOperation.invoke(RefreshOperation.java:35)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

As a result the watcher registered in org.apache.curator.framework.recipes.cache.PathChildrenCache#refresh is not triggered.

Test to reproduce:

@Test
public void test() throws Exception {
    TestingServer zkTestServer = new TestingServer(2181, false);
CuratorFramework curatorFramework = CuratorFrameworkFactory.newClient(
zkTestServer.getConnectString(),
5000,
1000,
<span class="code-keyword">new</span> RetryOneTime(100)
);
curatorFramework.start();
PathChildrenCache cache = <span class="code-keyword">new</span> PathChildrenCache(curatorFramework, <span class="code-quote">"/test"</span>, <span class="code-keyword">true</span>);
cache.start(PathChildrenCache.StartMode.POST_INITIALIZED_EVENT);

<span class="code-object">Thread</span>.sleep(5000);

zkTestServer.start();
curatorFramework.create().creatingParentContainersIfNeeded().forPath(<span class="code-quote">"/test/example"</span>);

<span class="code-keyword">while</span>(<span class="code-keyword">true</span>) {

Thread.sleep(1000); System.out.println(cache.getCurrentData()); } }


Originally reported by dnk, imported from: PathChildrenCache is not tolerant to failed connection to ZK on startup
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2025-01-21

jira-importer avatar Jul 13 '17 20:07 jira-importer