curator icon indicating copy to clipboard operation
curator copied to clipboard

[CURATOR-439] CuratorFrameworkState STARTED, but ZookeeperClient not connected

Open jira-importer opened this issue 8 years ago • 12 comments

I recently ran into an issue on some of our nodes caused by network issues between a service and Zookeeper. I have been unable to recreate them as of yet, but I'm still trying.

Setup
5x services using Curator 3.2.1 to talk to Zookeeper 3.5.3 cluster (also 5 nodes).

Network issues caused the services to disconnect from Zookeeper.

There's a check in our code to see if the Zookeeper connection is available before sending a request:

public boolean isConnected()

Unknown macro: { return curatorFramework.getZookeeperClient().isConnected(); }

After the network issues resolved, we noticed that all calls to Zookeeper from 4 of the services were still failing (the fifth was fine). Checking the logs, we saw that CuratorFramework.getState() was reporting the state as STARTED, but curatorFramework.getZookeeperClient().isConnected(); was returning false. Restarting the service fixed everything, but I want to obviously avoid this issue in future.

Problem
I couldn't find any documentation stating whether the CuratorZookeeperClient.isConnected() should be used, or if CuratorFramework.getState() == CuratorFrameworkState.STARTED (the functionality of the deprecated CuratorFramework.isConnected()) would be the better check, or if these should both be equivalent, and there's a bug that let one be true while the other was false.

If my own check is wrong, and I shouldn't be using CuratorZookeeperClient.isConnected(), then I can easily fix that. I wanted to check the expected behaviour before diving too deep into this, in case this is normal and I am just using Curator incorrectly.

Edit

This was a misunderstanding on my part. I'm leaving it open so that I can submit a documentation/example update shortly to hopefully clarify things a bit better for others.


Originally reported by Kenco, imported from: CuratorFrameworkState STARTED, but ZookeeperClient not connected
  • status: Open
  • priority: Minor
  • resolution: Unresolved
  • imported: 2025-01-21

jira-importer avatar Oct 24 '17 10:10 jira-importer

kenco:

From analysing the log files, it looks like the ConnectionState fluctuated between SUSPENDED and RECONNECTED a few times, and was LOST twice. The first time the connection was LOST, it RECONNECTED again afterwards. After the second time, there were no more ConnectionState changes.

It isn't clear from the documentation, but are we expected to close and restart the Curator instance if the ConnectionState is LOST? After looking through some other public codebases, it seems that this is the approach that others take.

jira-importer avatar Oct 25 '17 08:10 jira-importer

kenco:

Updating the status to major. After a network outage we saw the Curator status was STARTED, but the Zookeeper client was not connected. This resulted in an outage and forced us to bounce the applications.

jira-importer avatar Apr 20 '18 15:04 jira-importer

randgalt:

I don't understand what the bug is here. Curator status "STARTED" does not mean you are connected. You must look at Curator's ConnectionState. Please see details here: http://curator.apache.org/errors.html

No - you do not have to close and restart the curator instances. The only reason to recreate the Curator Instances is if the ip/port of the ZooKeeper instances changes. If they don't change, there is no need to ever re-create the Curator instances. You can see copious tests in the Curator code base where this is proven. 

Also, from this Jira's description, it would be wrong to check isConnected before making requests. This, itself, might cause an outage. Curator internally manages the ZooKeeper connection. If you don't make any requests when isConnected is false you might never get a successful connection. 

You can see from the Curator example code (http://curator.apache.org/curator-examples/index.html) that there's no need for a lot of what you are doing.

jira-importer avatar Apr 20 '18 15:04 jira-importer

kenco:

Thanks Jordan Zimmerman - I think the confusion just comes from the lack of good examples or explanation of the behaviour of Curator in different scenarios. We did have a ConnectionStateListener, but the following line in the documentation made us think there was more we should be doing:

Clients can monitor these changes and take appropriate action.

Looking at other libraries (like this), people seemed to be checking that the ZK Client was connected - so we thought that was a good practice. 

If I understand correctly, the following should be true:

  1. ConnectionStateListener does not need to do anything - it can be used purely to log changes in the state of Curator, but no further action is needed. LOST or SUSPENDED connections should automatically RECONNECT when the network is back up.
  2. I should not check getZookeeperClient().isConnected() before any action - just perform the action, and if the client isn't connected, it will connect (if possible).
  3. Should we check if the ConnectionState is CONNECTED, RECONNECTED or READ_ONLY, and only perform actions if the Curator ConnectionState is in one of those?

If I've got this right, then I'll make sure to close this ticket as "Not an Issue".

jira-importer avatar Apr 23 '18 09:04 jira-importer

randgalt:

That's mostly correct. However, it depends on lot on the recipes you use. For example, if you use InterProcessMutex you must have a ConnectionStateListener that interrupts your locks when the connection is lost (we recommend to do this on SUSPENDED see the error handling section here: http://curator.apache.org/curator-recipes/shared-lock.html). Every recipe has an error handling section.

Maybe you can turn this issue into a doc/example improvement issue. We'd love the docs to be better. 

jira-importer avatar Apr 23 '18 10:04 jira-importer

kenco:

That's perfect - thanks Jordan Zimmerman.

Yep, I'll see if I can work up something documentation/example wise and submit it back. It will probably be a week or two before I can get to it though.

jira-importer avatar Apr 23 '18 10:04 jira-importer

randgalt:

P.S. We have Tech Notes too where things can be clarified: https://cwiki.apache.org/confluence/display/CURATOR/Tech+Notes

jira-importer avatar Apr 23 '18 10:04 jira-importer

randgalt:

P.S. We have Tech Notes too where things can be clarified: https://cwiki.apache.org/confluence/display/CURATOR/Tech+Notes

jira-importer avatar Apr 23 '18 10:04 jira-importer

kenco:

Brilliant. I'll set aside some time and try get some of those updates in 

jira-importer avatar Apr 23 '18 10:04 jira-importer

kenco:

Jordan Zimmerman - final question - is there an easy way to check Curator's ConnectionState? There doesn't look to be a way to retrieve this from the CuratorFramework - getState() just returns the CuratorFrameworkState.

What I'm currently doing is setting the ConnectionState in the ConnectionStateListener - but are we guaranteed that this is always called in order (i.e. is it ever possible that a SUSPENDED event will hit the listener before a RECONNECTED despite them being sent in the opposite order)?

jira-importer avatar Apr 24 '18 08:04 jira-importer

randgalt:

is there an easy way to check Curator's ConnectionState?

Not currently. 

are we guaranteed that this is always called in order

Yes - there's a single thread that handles this in ConnectionStateManager.java

jira-importer avatar Apr 25 '18 23:04 jira-importer

kenco:

That's perfect then - I was reasonably sure, but just wanted to double check.

jira-importer avatar Apr 26 '18 08:04 jira-importer