DistributedLock icon indicating copy to clipboard operation
DistributedLock copied to clipboard

ZooKeeperNetEx connection loss issue: an acquired lock seems is not released

Open MattGhafouri opened this issue 3 years ago • 55 comments

I've implemented a lock with the Zookeeper with this configuration :

  1. DistributedLock.ZooKeeper - Version="1.0.0"
  2. dotnet version 6.0
  3. Hosted on K8s (one pod, there is no concurrent request)
  4. Zookeeper server configuration on K8s :

version: "3.9" services: zk1: container_name: zk1 hostname: zk1 image: bitnami/zookeeper:3.8.0-debian-11-r57 ports: - 2181:2181 environment: - ALLOW_ANONYMOUS_LOGIN=yes - ZOO_SERVER_ID=1 - ZOO_SERVERS=0.0.0.0:2888:3888 - ZOO_MAX_CLIENT_CNXNS=500

There are several worker services inside the application, each of them working with a different lock key. periodically it tries to accuqire the lock and do some processes. It seems they are working without problem, but after a while, I get this exception Locking failed.Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

It seems the lock cannot be acquired because it has not been released, although there is no concurrent request for the lock key.

The LockService code in dotnet :

    `
     private TimeSpan _connectionTimeoutInSecond = TimeSpan.FromSeconds(30);
     private TimeSpan _waitingForLockInSecond = TimeSpan.FromSeconds(30);
     public async Task<LockProcessResult> DoActionWithLockAsync(string lockKey, Func<Task> func)
       {
      var processResult = new LockProcessResult();
      try
      {
        var @lock = new ZooKeeperDistributedLock(lockKey, _configuration.ConnectionString, opt =>
        {
            opt.ConnectTimeout(_connectionTimeoutInSecond);
        });

        await using (var handle = await @lock.TryAcquireAsync(timeout: _waitingForLockInSecond))
        {
            if (handle != null)
            {
                // I have the lock 
                await func(); 
            }
            else
            {
                processResult.SetException(new LockAcquisitionFailedException(lockKey)); 
            }
        }

     }
     catch (Exception ex)
     {
        //I got the exceptions here
        processResult.SetException(ex); 
     }

     return processResult;
 }`

I appreciate any suggestion

MattGhafouri avatar Nov 21 '22 05:11 MattGhafouri

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

hoerup avatar Nov 21 '22 08:11 hoerup

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

Thanks, I'll check it.

MattGhafouri avatar Nov 21 '22 08:11 MattGhafouri

@MajeQafouri

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

+1

It seems the lock cannot be acquired because it has not been released

What makes you think this is the issue? The exception you reported mentions connection loss.

after a while, I get this exception

Can you please report the full exception.ToString() including the stack trace? This will allow us to see where inside DistributedLock it is hitting the exception.

madelson avatar Nov 21 '22 11:11 madelson

This is the stack trace :

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List1 acl, CreateMode createMode) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 535 at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable1 aclEnumerable, Boolean ensureDirectoryExists) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperNodeCreator.cs:line 38 at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable1 aclEnumerable, Boolean ensureDirectoryExists) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperNodeCreator.cs:line 54 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:line 70 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:line 132 at Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion.Threading.Internal.IInternalDistributedLock<Medallion.Threading.ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperDistributedLock.cs:line 97

MattGhafouri avatar Nov 22 '22 06:11 MattGhafouri

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

madelson avatar Nov 22 '22 11:11 madelson

Actually, after hosting on different envs, I can say It would relate to the client hosting environment, let me make it clear.

client host env: local Docker desktop (win 10 64x) zookeeper: hosted on k8s result: facing the same exception once in a while

client host env: running in debug mode of Visual Studio 2022(win 10 64x) zookeeper: hosted on k8s result: working without problem

client host env: hosted on k8s zookeeper: hosted on k8s result: facing the same exception once in a while.

But, still I don't know what is the issue and how should I fix it

MattGhafouri avatar Nov 22 '22 14:11 MattGhafouri

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

It seems in case of an exception the acquired lock is not released, and even during the next tries, it cannot be acquired.

MattGhafouri avatar Nov 22 '22 14:11 MattGhafouri

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

As I described different client hosting envs, I can say it's not related to pod/container restarts,

MattGhafouri avatar Nov 22 '22 14:11 MattGhafouri

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

I forgot to mention I don't receive requests from outside, my application is a web application that has more than 50 workers, and in a timely manner every five seconds trying to do some functionality( with different lock keys(specific for each worker)).

Is it possible the issue relates to the number of TCP sockets for each request we are opening? maybe the reason is that it works without issue on windows(visual studio) and as I containerize the application it starts to face some issues.

MattGhafouri avatar Nov 22 '22 14:11 MattGhafouri

Is it possible the issue relates to the number of TCP sockets for each request we are opening?

This is possible, but I really would recommend pulling the logs from the zookeeper side and seeing what zookeeper thinks is the reason for connection loss (e.g. is the client timing out, too many connections, etc).

madelson avatar Nov 23 '22 10:11 madelson

@MajeQafouri Have you seen this one https://github.com/shayhatsor/zookeeper/issues/45

hoerup avatar Nov 23 '22 10:11 hoerup

@MajeQafouri Have you seen this one shayhatsor/zookeeper#45

Thanks for mentioning the link, I believe the root of the issue is the same. but do you have any solution for that?

MattGhafouri avatar Nov 23 '22 11:11 MattGhafouri

I tried to directly use the zookeeper library https://www.nuget.org/profiles/shayhatsor2 I faced the same issue in the dockerized environment, then traced the Nuget package code, and got an exception in the WriteLock method. Maybe the author can help us @shayhatsor

WARNING WriteLock Caught: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. at org.apache.zookeeper.ZooKeeper.deleteAsync(String path, Int32 version) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 587 at org.apache.zookeeper.recipes.lock.WriteLock.DeleteNode.execute() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 120 at org.apache.zookeeper.recipes.lock.WriteLock.unlock() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 85] Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. at org.apache.zookeeper.ZooKeeper.deleteAsync(String path, Int32 version) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 587 at org.apache.zookeeper.recipes.lock.WriteLock.DeleteNode.execute() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 120 at org.apache.zookeeper.recipes.lock.WriteLock.unlock() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 85

MattGhafouri avatar Nov 23 '22 13:11 MattGhafouri

@MajeQafouri I'm looking into whether we can work around this problem within DistributedLock itself.

madelson avatar Nov 30 '22 00:11 madelson

@MajeQafouri I've published a prerelease version of the Medallion.ZooKeeper package here: https://www.nuget.org/packages/DistributedLock.ZooKeeper/1.0.1-alpha001

This replaces the ZooKeeperNetEx dependency with my initial attempt at resolving the issue (I haven't reproduced the issue locally so we'll see how it goes).

Can you pull down the prerelease and let me know whether that resolves your issue? Thanks!

madelson avatar Dec 03 '22 15:12 madelson

@MajeQafouri did you get a chance to test out the prerelease package?

madelson avatar Dec 16 '22 12:12 madelson

Same Issue Here! Actually I just faced with this issue with 3 pods trying to use zookeeper in k8s.. I'll try to use the alpha package and give a feedback here

devlnull avatar Dec 19 '22 14:12 devlnull

@MajeQafouri did you get a chance to test out the prerelease package?

Sorry for the late answer, unfortunately, I'm packed these days, and couldn't test the new Package, BTW thanks for the effort. As soon as I manage to test it, keep you posted.

MattGhafouri avatar Dec 20 '22 08:12 MattGhafouri

Hi, We could finally test the alpha package. I don't know if it helps, but We've test it and compare with old package, the result was acceptable. no connection loss anymore. Good luck.

devlnull avatar Dec 20 '22 09:12 devlnull

@devlnull excellent. Thanks for testing! Feel free to use the prerelease version for now and keep me posted on any issues you encounter. The only change is the ZooKeeperNetEx alternative.

If @MajeQafouri also comes back with a good test result I'll publish it as a stable version.

madelson avatar Dec 22 '22 00:12 madelson

Finally, I tested the new alpha version, overall It's much better with just a few connection Loss Exception, The Hosting environment :

client host env: hosted on k8s (3 pods) zookeeper: hosted on k8s (1 pod) result: facing some Connection loss exceptions rarely.

The exception stack trace :

   .ZookeeperLockProviderService[0]
    ---------Locking failed. Key:MyKeyName--.Exception of type 
        'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
        org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 
       'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
        at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List`1 acl, CreateMode createMode) in 
        C:\dev\zookeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 542
       at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0. 
       <<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in 
        /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 72
     --- End of stack trace from previous location ---
       at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0. 
      <<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in 
      /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 64
     --- End of stack trace from previous location ---
       at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection 
       connection, ZooKeeperPath directory, String namePrefix, IEnumerable`1 aclEnumerable, Boolean ensureDirectoryExists) in 
       /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 27
       at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 
       waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in 
       /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 65
       at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 
          waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) 
        in /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 127
       at 
       Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion. 
       Threading.Internal.IInternalDistributedLock<Medallion.Threading. 
       ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken 
      cancellationToken) in /_/DistributedLock.ZooKeeper/ZooKeeperDistributedLock.cs:line 91
       at .ZookeeperLockProviderService.DoActionWithLockAsync[TParameter](String lockKey, TParameter parameter, Func`2 
       func) in /src/src/Infrastructure//ZookeeperLockProviderService.cs:line 113
       .ZookeeperLockProviderService[0]

I hope it would be helpful.

MattGhafouri avatar Dec 23 '22 10:12 MattGhafouri

Thanks @MajeQafouri !

For these few remaining connection losses, do they also appear in the Zookeeper logs or are they still “phantom connection losses”? Does the application recover after this happens or no?

madelson avatar Dec 23 '22 12:12 madelson

In the Zookeeper log, we don't get exception logs, but on the app side we have them, and the Application seems to recover itself after connection loss exception, actually after sessionExpiredException.

again I run the app on k8s with 3 pods and zookeeper service on one pod, I send you the app log also the zookeeper log, you can also check the acquirement and release logs which include exceptions, I hope it would be helpful.

MyApp_Pod1_Logs.txt MyApp_Pod2_Logs.txt MyApp_Pod3_Logs.txt logs-from-zookeeper.txt

MattGhafouri avatar Dec 26 '22 07:12 MattGhafouri

@MajeQafouri I do see a disconnect/session expired log in the zookeeper log, though. Could that be related?

madelson avatar Jan 01 '23 14:01 madelson

I hardly believe so, but I couldn't get why we get this error once in a while in Zookeeper console :"Connection reset by peer"

MattGhafouri avatar Jan 02 '23 09:01 MattGhafouri

Unfortunately we are getting Connection Loss sometimes, but it will be gone in a minute.

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

devlnull avatar Jan 02 '23 09:01 devlnull

@devlnull are you seeing "connection reset by peer"/"connection expired" in the zookeeper logs like @MajeQafouri is?

madelson avatar Jan 04 '23 00:01 madelson

@madelson Actually I don't have access to zookeeper logs yet, but I will soon. for now I have this stack trace info, it might help.

   at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List`1 acl, CreateMode createMode) in C:\dev\zookeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 542
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0.<<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 72
--- End of stack trace from previous location ---
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0.<<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 64
--- End of stack trace from previous location ---
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable`1 aclEnumerable, Boolean ensureDirectoryExists) in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 27
   at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 65
   at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 127
   at Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion.Threading.Internal.IInternalDistributedLock<Medallion.Threading.ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken cancellationToken) in /_/DistributedLock.ZooKeeper/ZooKeeperDistributedLock.cs:line 91
   at Medallion.Threading.Internal.DistributedLockHelpers.ThrowTimeoutIfNull[T](ValueTask`1 task, String object) in /_/DistributedLock.Core/Internal/DistributedLockHelpers.cs:line 142
   at Medallion.Threading.Internal.Helpers.Convert[TDerived,TBase](ValueTask`1 task, ValueTaskConversion _) in /_/DistributedLock.Core/Internal/Helpers.cs:line 24

devlnull avatar Jan 04 '23 09:01 devlnull

@devlnull @MajeQafouri I thought it might make sense for me to add some additional verbose logging to the underlying ZooKeeper .NET package; then you could test your apps with the additional logging and we can see if that helps point out the root cause of the issue. Would either/both of you be willing to test in this way?

madelson avatar Jan 06 '23 02:01 madelson

@devlnull @MajeQafouri I thought it might make sense for me to add some additional verbose logging to the underlying ZooKeeper .NET package; then you could test your apps with the additional logging and we can see if that helps point out the root cause of the issue. Would either/both of you be willing to test in this way?

Good idea, I'll test it

MattGhafouri avatar Jan 06 '23 06:01 MattGhafouri