curator [CURATOR-678] InterProcessMutex#release caused inconsistency between zk node and local cache if encountering zk connection lost

We experienced a problem that

an InterProcessMutex participant acquired the lock -> when release() was running, it encountered zk connection lost, then there was inconsistency as in codes https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessMutex.java#L139

to line 143, that the zk node deletion threw exception for connection lost, but the local cached `threadData` still removed it.

As a result, even when the zk connection recovered later, ALL following acquire() failed due to the inconsistency (not present in local `threadData` but the OLD zk node were still present).

Please help confirm this behavior. I think it is bug and curator should fix the inconsistency, a suggestion is to remove the local data ONLY after znode deletion is a success. Also, the same problematic code seems appearing in many other similar recipes such as `InterProcessSemaphore`.

Stacktrace:

```

Failed to release mutex for xxxxxxxxxxxxx org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /xxxx/_c_65fb02ef-9b1d-4c8c-b715-5c97f82ae0d3-lock-0000000000 at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.6.3.jar:3.6.3] at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[zookeeper-3.6.3.jar:3.6.3] at org.apache.curator.framework.imps.DeleteBuilderImpl$6.call(DeleteBuilderImpl.java:313) ~[curator-framework-5.3.0.jar:5.3.0] at org.apache.curator.framework.imps.DeleteBuilderImpl$6.call(DeleteBuilderImpl.java:301) ~[curator-framework-5.3.0.jar:5.3.0] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) ~[curator-client-5.3.0.jar:?] at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:298) ~[curator-framework-5.3.0.jar:5.3.0] at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:282) ~[curator-framework-5.3.0.jar:5.3.0] at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35) ~[curator-framework-5.3.0.jar:5.3.0] at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:347) ~[curator-recipes-5.3.0.jar:5.3.0] at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:124) ~[curator-recipes-5.3.0.jar:5.3.0] at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154) ~[curator-recipes-5.3.0.jar:5.3.0] at

... ...

```

Originally reported by rikimberley, imported from: InterProcessMutex#release caused inconsistency between zk node and local cache if encountering zk connection lost

assignee: eolivelli
status: Open
priority: Major
resolution: Unresolved
imported: 2025-01-21

Jun 06 '23 01:06 jira-importer

kezhuw:

LockInternals::deleteOurPath uses Guaranteeable::guaranteed to delete lock path. If the session is repaired before expired, it was supposed to delete that path in background.

Did you configure the retry policy ? CONNECTIONLOSS has to be retriable for this to function.

PS: I replied similar in ~~CURATOR-486~~.

Jun 06 '23 02:06 jira-importer

JIRAUSER291535:

Kezhu Wang Yes. we have configured retry policy. but we don't retry forever, we set a limit for number of retries. the exception was thrown after all retries failed.

Jun 06 '23 03:06 jira-importer

kezhuw:

but we don't retry forever, we set a limit for number of retries

That is ok. guaranteed is supposed to ignore retry limit.

As a result, even when the zk connection recovered later, ALL following acquire() failed due to the inconsistency (not present in local `threadData` but the OLD zk node were still present).

Any possibility for an reproducible test case ?

a suggestion is to remove the local data ONLY after znode deletion is a success.

A client "failure" could be a success in server side. This will introduce double-leader.

Jun 06 '23 05:06 jira-importer