curator icon indicating copy to clipboard operation
curator copied to clipboard

[CURATOR-355] Curator client fails when connecting to read-only ensemble

Open jira-importer opened this issue 9 years ago • 15 comments

ZK is 3.5.1-alpha

I have a 3 nodes ZK cluster , readonly mode is enabled.
2 nodes are down, so one of them (QA-E8WIN11) is in read-only (verified by using the ZK API manually). All the machines of the ensemble can be pinged from the client.

I'm using this piece of code:

		Builder curatorClientBuilder = CuratorFrameworkFactory.builder()
				.connectString("QA-E8WIN11:2181,QA-E8WIN12:2181")
				.sessionTimeoutMs(45000).connectionTimeoutMs(15000)
				.retryPolicy(new RetryNTimes(3, 5000)).canBeReadOnly(true);
	CuratorFramework client = curatorClientBuilder.build();
	client.start();
	client.getZookeeperClient().blockUntilConnectedOrTimedOut();
	<span class="code-object">System</span>.out.println(<span class="code-quote">"Successfully established the connection with ZooKeeper"</span>);
	
	client.getData().forPath(<span class="code-quote">"/"</span>);
	<span class="code-object">System</span>.out.println(<span class="code-quote">"Done."</span>);</pre>

When curator pick the host that is UP first, it goes through very quickly. When it picks the host that is down first (QA-E8WIN12), it seems to be stuck at the getData() call for a very long time, and then eventually fail with a ConnectionLossException. (see attached log)


Originally reported by benjamin.jaton, imported from: Curator client fails when connecting to read-only ensemble
  • status: Open
  • priority: Critical
  • resolution: Unresolved
  • imported: 2025-01-21

jira-importer avatar Oct 10 '16 17:10 jira-importer

randgalt:

Is this Curator 3.x? or 2.x?

jira-importer avatar Oct 10 '16 17:10 jira-importer

benjamin.jaton:

I used Curator 2.11.0.

jira-importer avatar Oct 10 '16 17:10 jira-importer

randgalt:

Firstly, the call to client.getZookeeperClient().blockUntilConnectedOrTimedOut(); is unnecessary as Curator does this internally.

Curator 3.0 has better connection timeout behavior than Curator 2.0. In 2.0, the connection timeout is applied for each iteration of the Retry Policy. So, in this case, you'd expect getData() to wait 15 seconds * 3, plus 5 seconds * 3 for a total of one minute. In my recreation of your test that's exactly what I see:

System.setProperty("readonlymode.enabled", "true");
TestingCluster cluster = new TestingCluster(3);
cluster.getServers().get(0).stop();
cluster.getServers().get(1).stop();

CuratorFrameworkFactory.Builder curatorClientBuilder = CuratorFrameworkFactory.builder() .connectString(cluster.getConnectString()) .sessionTimeoutMs(45000).connectionTimeoutMs(15000) .retryPolicy(new RetryNTimes(3, 5000)).canBeReadOnly(true);

CuratorFramework client = curatorClientBuilder.build(); client.start(); client.getZookeeperClient().blockUntilConnectedOrTimedOut(); System.out.println("Successfully established the connection with ZooKeeper");

client.getData().forPath("/"); System.out.println("Done.");

With Curator 3.0, the time improves to just 15 seconds * 2 - the connection timeout number twice. Once for the blockUntilConnectedOrTimedOut() and once for the getData(). Note: blockUntilConnectedOrTimedOut() in all cases would've returned false implying you should not continue.

jira-importer avatar Oct 10 '16 17:10 jira-importer

benjamin.jaton:

Sounds good, I will check the 3.x release. But I won't be able to use it for existing deployments, any chance to fix version 2.x?

jira-importer avatar Oct 10 '16 18:10 jira-importer

randgalt:

It would be hard to back port to 2.x.

jira-importer avatar Oct 10 '16 20:10 jira-importer

benjamin.jaton:

Does it need a backport? Just a fix of the existing mechanism is enough, no need to change the whole thing if possible.

jira-importer avatar Oct 10 '16 21:10 jira-importer

randgalt:

It's one of those things where I can't know if people are depending on it. Maybe we can add a System property to change to new behavior. That would be OK.

jira-importer avatar Oct 10 '16 21:10 jira-importer

benjamin.jaton:

Just to clarify, in this case there is still 1 of the ZK node started, so then the Curator client should successfully connect to it, and blockUntilConnectedOrTimedOut() should return true.

jira-importer avatar Oct 10 '16 21:10 jira-importer

randgalt:

Yes - though it's ZooKeeper doing the actual connecting.

jira-importer avatar Oct 10 '16 21:10 jira-importer

benjamin.jaton:

So when I connect using ZK API directly with sessionTimeout=45000, and when it picks up the server that is NOT started first, it takes the ZK client API 22 seconds (45/2?) to try the second server, which then works and I get my connection.

In contrast Curator seems to wait only connectionTimeout=15000 in blockUntilConnectedOrTimedOut(), so it seems like it's failing because it's stops trying too early.

jira-importer avatar Oct 10 '16 22:10 jira-importer

randgalt:

ConnectionTimeout is a Curator concept. You should set it to whatever you need. ZooKeeper fails a heartbeat after 2/3 of a session as you have seen.

jira-importer avatar Oct 11 '16 09:10 jira-importer

benjamin.jaton:

Is there a documentation somewhere that talks about what connectionTimeout means for Curator?
I thought it was the timeout of the connection to a specific node.

Also I don't think ZK fails after 2/3 of a session. From my tests it seems to fail at (sesstionTimeout / nbServersInConnectionString).

jira-importer avatar Oct 11 '16 16:10 jira-importer

benjamin.jaton:

Let's note that the code you provided using the TestingCluster class cannot be used to reproduce the behavior stated in the bug, as local connection will be actively denied if the port is not open, whereas in the original example, the TCP connection will timeout.

jira-importer avatar Oct 11 '16 21:10 jira-importer

ken.liu.geminidata:

This ticket will not be solved? I also face this issue, when one of the zk is down, I receive the ConnectionLossException, and function could not works normally.

jira-importer avatar Mar 28 '19 21:03 jira-importer

randgalt:

>  This ticket will not be solved?

 We'd need a PR with the solution. I can help you if you take it on.

jira-importer avatar Mar 28 '19 22:03 jira-importer