dotnet-etcd icon indicating copy to clipboard operation
dotnet-etcd copied to clipboard

Etcd client connection issue when etcd restarts

Open satishviswanathan opened this issue 5 years ago • 10 comments

I'm using etcd client in my web api application and have implemented watch which works fine. However i'm facing the following issue.

Etcd connection gets disconnected

To Reproduce Steps to reproduce the behavior:

  1. During web api bootstrap initialize the etcd client and read the keys , registered a watch and loaded the IConfiguration with the keys.
  2. Change any value of the key in etcd
  3. We are able to get the call back with the updated value.
  4. Stop etcd and start it again.
  5. Change the value of any key
  6. Unable to receive the watch callback. It seems like with the etcd restart the connection was lost and has not reestablished the connection.

Expected behavior Application should be able to receive the callback notification with the updated key / value

satishviswanathan avatar Jun 22 '20 21:06 satishviswanathan

@satishviswanathan Do you get any sort of exception on etcd server restart ?

shubhamranjan avatar Jul 03 '20 20:07 shubhamranjan

@shubhamranjan i'm not seeing exception but seems like the application is losing the connection with etcd and subsequent watch notification is not received after restart.

satishviswanathan avatar Jul 10 '20 23:07 satishviswanathan

Noted. Etcd restarts haven't been handled yet for etcd watches are neither is there a provision for an exception handler. This are is something we need to improvise on.

@yoricksmeets Did you get a chance to work on the watch manager ?

shubhamranjan avatar Jul 15 '20 16:07 shubhamranjan

I connect to etcd cluster with dotnet-etcd version 3.2.0. the cluster was compose of three etcd nodes. In beginning I have connected to the cluster succussfully and watch some datas changed. But when I shutdown etcd nodes one by one, the applicaton thorow a unhandled exception as follows: Grpc.Core.RpcException HResult=0x80131500 Message=Status(StatusCode=Unknown, Detail="Stream removed") Source=System.Private.CoreLib StackTrace: at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Grpc.Core.Internal.ClientResponseStream`2.<MoveNext>d__5.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<>c__DisplayClass122_1.<<WatchRange>b__0>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<WatchRange>d__122.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() at System.Threading.ThreadPoolWorkQueue.Dispatch()

yanhua1012 avatar Nov 17 '20 08:11 yanhua1012

@yanhua1012 Can you explain a bit as it happens ?. For e.g.

  1. 1 node goes down, 1 error occurs and subsequent work and so on.......

shubhamranjan avatar Dec 08 '20 12:12 shubhamranjan

step1. the etcd client connect to etcd cluster successfully and watch some datas step2. shutdown the first node of cluster step3. shutdown the second node of cluster step4. start the second node of cluster then one unhandled exception as follows happened (not immediately)

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult() at Grpc.Core.Internal.ClientResponseStream`2.<MoveNext>d__5.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<>c__DisplayClass122_1.<<WatchRange>b__0>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<WatchRange>d__122.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() at System.Threading.ThreadPoolWorkQueue.Dispatch()

yanhua1012 avatar Dec 14 '20 04:12 yanhua1012

another case step1. the etcd client connect to etcd cluster successfully and watch some datas step2. shutdown the first node of cluster step3. shutdown the second node of cluster step4. shutdown the third node of cluster then one unhandled exception as follows happened (not immediately) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<WatchRange>d__122.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() at System.Threading.ThreadPoolWorkQueue.Dispatch()

yanhua1012 avatar Dec 14 '20 04:12 yanhua1012

@shubhamranjan , Just reaching out to see if this scenario is being considered in future release as a feature.

satishviswanathan avatar Mar 28 '22 18:03 satishviswanathan

I'm using etcd client in my web api application and have implemented watch which works fine. However i'm facing the following issue.

Etcd connection gets disconnected

To Reproduce Steps to reproduce the behavior:

  1. During web api bootstrap initialize the etcd client and read the keys , registered a watch and loaded the IConfiguration with the keys.
  2. Change any value of the key in etcd
  3. We are able to get the call back with the updated value.
  4. Stop etcd and start it again.
  5. Change the value of any key
  6. Unable to receive the watch callback. It seems like with the etcd restart the connection was lost and has not reestablished the connection.

Expected behavior Application should be able to receive the callback notification with the updated key / value

silent reconnect to watch may be dangerous. if watch starts from specific revision, callbacks called again from these revision and its break watch stream ordered guarantee https://etcd.io/docs/v3.3/learning/api_guarantees/#consistency if watch start from current revision, reconnect can skip some events

we are sometimes got both these problems.

but root of the problem is that occasionally connection didnt closed, and watch wait infintly. and Client code cant handle this.

that can be easy reproduce with kubectl port-forward to any etcd in k8s, then start local etcdctl watch on etcd and stop port-forwarding.

problem may be solved with grpc pings, but they are available only from dotnet6+ https://docs.microsoft.com/en-us/aspnet/core/grpc/performance?view=aspnetcore-6.0#keep-alive-pings

setood avatar Jul 16 '22 10:07 setood

Thanks @setood . I see a lot of great features have been added to the grpc client. I will try to re design the whole client again.

IMO, I think it should be fair to say that people should be able to easily upgrade their dotnet versions, considering breaking changes are not that much in most areas.

shubhamranjan avatar Jul 16 '22 10:07 shubhamranjan