curator icon indicating copy to clipboard operation
curator copied to clipboard

[CURATOR-157] Avoid stack traces closing PathChildrenCache followed by closing CuratorFramework

Open jira-importer opened this issue 11 years ago • 17 comments

When closing PathChildrenCache, and immediately afterwards closing CuratorFramework, some ERROR-level stack traces are logged.

This was previously reported on the mailing list: http://curator.markmail.org/thread/bmfr62ekx5p2vv7f

The cause is that the BackgroundCallback defined in PathChildrenCache.refresh() will, when triggered, perform some more ZooKeeper operations.

Thus one can get in sequences such as:

  • operation with BackgroundCallback is submitted
  • processResult of the BackgroundCallback is called
  • PathChildrenCache is closed
  • CuratorFramework is closed
  • processResult, which is running on another thread, comes to the point it does operations on ZooKeeper, which fail because ZooKeeper is closed.

There is no real impact on the application, it is just for log-esthetical reasons that I'd like to avoid it.

In the more common case, the processResult will receive an IllegalStateException, which could be easily catched and ignored in PathChildrenCache if the PathChildrenCache is closed:

14/10/30 11:24:51 ERROR org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up
java.lang.IllegalStateException: instance must be started before calling this method
	at com.google.common.base.Preconditions.checkState(Preconditions.java:149)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getData(CuratorFrameworkImpl.java:360)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache.getDataAndStat(PathChildrenCache.java:545)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache.processChildren(PathChildrenCache.java:668)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache.access$200(PathChildrenCache.java:68)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache$4.processResult(PathChildrenCache.java:490)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:715)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:502)
	at org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:166)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:590)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

But sometimes it also fails with other async operations deeper down:

14/10/30 11:24:51 ERROR org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up
java.lang.IllegalStateException: Client is not started
	at com.google.common.base.Preconditions.checkState(Preconditions.java:149)
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:113)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:474)
	at org.apache.curator.framework.imps.GetDataBuilderImpl.performBackgroundOperation(GetDataBuilderImpl.java:263)
	at org.apache.curator.framework.imps.OperationAndData.callPerformBackgroundOperation(OperationAndData.java:65)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:789)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:487)
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:275)
	at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache.getDataAndStat(PathChildrenCache.java:545)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache.processChildren(PathChildrenCache.java:668)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache.access$200(PathChildrenCache.java:68)
	at org.apache.curator.framework.recipes.cache.PathChildrenCache$4.processResult(PathChildrenCache.java:490)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:715)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:502)
	at org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:166)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:590)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

Therefore I have created a patch where PathChildrenCache.close() will wait until the possibly running BackgroundCallback is finished.

I will also attach a small class that illustrates the problem.


Originally reported by bruno, imported from: Avoid stack traces closing PathChildrenCache followed by closing CuratorFramework
  • assignee: randgalt
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2025-01-21

jira-importer avatar Oct 30 '14 10:10 jira-importer

bruno:

While adapting my patch for master (I was working on 2.6.0), I see the current code already contains a change that makes the situation somewhat less likely, as it checks this in processResult:

    if (PathChildrenCache.this.state.get().equals(State.CLOSED)) {
// This ship is closed, don't handle the callback
return;
    }

this makes it occure less often, but doesn't avoid the problem entirely.

jira-importer avatar Oct 30 '14 11:10 jira-importer

githubbot:

GitHub user bdumon opened a pull request:

https://github.com/apache/curator/pull/51

CURATOR-157 Avoid stack traces on close

Avoid error-level stack traces closing PathChildrenCache followed by closing CuratorFramework

See details in jira: https://issues.apache.org/jira/browse/CURATOR-157

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bdumon/curator CURATOR-157

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/curator/pull/51.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #51


commit 5ca022d2196f1b763043235fbdec439c1769fa9d
Author: Bruno Dumon
Date: 2014-10-30T11:05:51Z

CURATOR-157 Avoid stack traces on close

Avoid error-level stack traces closing PathChildrenCache followed by closing CuratorFramework


jira-importer avatar Oct 30 '14 11:10 jira-importer

bruno:

Patch submitted via github: https://github.com/apache/curator/pull/51

jira-importer avatar Oct 30 '14 11:10 jira-importer

githubbot:

Github user madrob commented on a diff in the pull request:

https://github.com/apache/curator/pull/51#discussion_r19606266

— Diff: curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/PathChildrenCache.java —
@@ -382,6 +383,11 @@ public void close() throws IOException
connectionStateListener = null;
childrenWatcher = null;
dataWatcher = null;
+
+ synchronized (backgroundTaskMonitor) {
— End diff –

Can you explain why this works in a little bit more detail? My immediate intuition is that the compiler/JIT would optimize an empty synchronized block out of the code.

jira-importer avatar Oct 30 '14 14:10 jira-importer

githubbot:

Github user bdumon commented on a diff in the pull request:

https://github.com/apache/curator/pull/51#discussion_r19611104

— Diff: curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/PathChildrenCache.java —
@@ -382,6 +383,11 @@ public void close() throws IOException
connectionStateListener = null;
childrenWatcher = null;
dataWatcher = null;
+
+ synchronized (backgroundTaskMonitor) {
— End diff –

An empty synchronized block is not like an empty if statement or such, since "getting the lock" is a meaningful operation by itself. This being said, I could rewrite it to use an java.util.concurrent Lock object, or maybe something else?

At a high level, my change will make the close() method wait until a potentially running background refresh operation has finished (and it also assures that no new refresh can start after the close method has returned). I guess this should be fine, though it could cause some delay on the close() in case of very large caches. This could be avoided by extra is-closed checks in the for-loop in processChildren(). I didn't consider this earlier, I can add it to my patch.

jira-importer avatar Oct 30 '14 15:10 jira-importer

githubbot:

Github user madrob commented on a diff in the pull request:

https://github.com/apache/curator/pull/51#discussion_r19614037

— Diff: curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/PathChildrenCache.java —
@@ -382,6 +383,11 @@ public void close() throws IOException
connectionStateListener = null;
childrenWatcher = null;
dataWatcher = null;
+
+ synchronized (backgroundTaskMonitor) {
— End diff –

I'm still not sure I follow the logic, sorry. What is the benefit of making the close method wait before returning? It seems like you would need to acquire the lock earlier in the method to have an impact on control flow.

jira-importer avatar Oct 30 '14 15:10 jira-importer

githubbot:

Github user bdumon commented on a diff in the pull request:

https://github.com/apache/curator/pull/51#discussion_r19621734

— Diff: curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/PathChildrenCache.java —
@@ -382,6 +383,11 @@ public void close() throws IOException
connectionStateListener = null;
childrenWatcher = null;
dataWatcher = null;
+
+ synchronized (backgroundTaskMonitor) {
— End diff –

No problem, there might a problem in my reasoning, which is as follows: a refresh might be going on when PathChildrenCache.close() is called. The refresh runs on a background thread, the ZK event thread if I'm not mistaken. Since the processResult will be running within a synchronized block, the PathChildrenCache.close() method will not be able to return until the processResult is finished, since it tries to acquire the same lock.

The second synchronized block in refresh(), around the statements "if-closed-return, getChildren()..." is for similar reasons: the if-closed in that block might be called before the closed state is set, while the getChildren might be launched afterwards, but by then zookeeper might already be closed as well. By putting it in a synchronized block, we are sure no ZooKeeper operation is going on after returning from the close() method, and CuratorFramework/ZooKeeper can be closed without causing errors to be logged.

jira-importer avatar Oct 30 '14 17:10 jira-importer

githubbot:

Github user madrob commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-73259604

Looks like I neglected to respond to this for a while. Here is why I think the logic is faulty.

T2: `processResult()` method entry
T2: `synchronized (backgroundTaskMonitor)` acquire lock
T2: `if (state.get().equals(State.CLOSED)) { }` evaluates false
T1: `close()` method entry
T1: `if ( state.compareAndSet(State.STARTED, State.CLOSED) )` evaluates true
T1: `listeners.clear()`
T1: `client.getConnectionStateListenable().removeListener(connectionStateListener)`
T1 suspends waiting for the lock
T2: `processBackgroundResult(event)` gets called on a closed connection

I think the solution is to put all of the `close()` method into the `synchronized` block, but that is not a super appealing answer. Maybe it will be fine though.

The other concern is that now we can only process a single background event at a time, which seems like it could be a pretty high price to pay.

jira-importer avatar Feb 06 '15 16:02 jira-importer

githubbot:

Github user bdumon commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-74473597

I follow your reasoning, until the last event:

T2: processBackgroundResult(event) gets called on a closed connection

There is no connection closed by NodeCache or PathChildrenCache. The point of the patch is that after NodeCache/PathChildrenCache.close() returns, they won't be making use of the ZK connection anymore, so that CuratorFramework.close() can be called without causing stacktraces.

jira-importer avatar Feb 16 '15 08:02 jira-importer

githubbot:

Github user madrob commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-74693287

I don't think it matters whether or not they actually use the ZK Connection at that point. The important part is that `getState() == State.CLOSED` (as set by T1) and then when the `processBackgroundResult` call happens it will check for `State.STARTED`, which it will be false.

jira-importer avatar Feb 17 '15 16:02 jira-importer

githubbot:

Github user bdumon commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-74829158

Hmm, I don't see any such check? Is this in NodeCache or PathChildrenCache?

jira-importer avatar Feb 18 '15 08:02 jira-importer

githubbot:

Github user madrob commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-74895065

In the JIRA issue, the example trace came from `GetChildrenBuilderImpl` line 166.

jira-importer avatar Feb 18 '15 16:02 jira-importer

githubbot:

Github user bdumon commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-75036380

In those stack traces, it is CuratorFramework/CuratorZookeeperClient which check if their state is started (which is stil true, since we have not closed them, we have closed the cache), not NodeCache/PathChildrenCache.

Or are you referring to something else?

jira-importer avatar Feb 19 '15 11:02 jira-importer

githubbot:

Github user madrob commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-75078668

Ok, I think I got myself confused between the two close operations.

Can you add a test (probably with an artificial delay to ensure ordering) that would fail before your patch and pass after to make sure that we properly handle this? Sorry to push back on this, but I'm not sure anymore of my own ability to understand the code 100%.

jira-importer avatar Feb 19 '15 16:02 jira-importer

githubbot:

Github user bdumon commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-75529312

For PathChildrenCache, there was an existing test added as part of https://issues.apache.org/jira/browse/CURATOR-141, I've modified it a bit so that it is more likely to reproduce the problem. I also added a similar test for NodeCache.

jira-importer avatar Feb 23 '15 11:02 jira-importer

githubbot:

Github user hdeadman commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-173025262

What's the status on this pull request? It looks like @bdumon added a test. Is the issue that there are branch conflicts? Does the error not happen in 3.0? I am getting it on 2.9.1 if I use a try block that relies on the Closable interface.

jira-importer avatar Jan 19 '16 23:01 jira-importer

githubbot:

Github user cammckenzie commented on the pull request:

https://github.com/apache/curator/pull/51#issuecomment-173026908

@madrob Are you able to take a look at this?

jira-importer avatar Jan 19 '16 23:01 jira-importer