Etcd client connection issue when etcd restarts
I'm using etcd client in my web api application and have implemented watch which works fine. However i'm facing the following issue.
Etcd connection gets disconnected
To Reproduce Steps to reproduce the behavior:
- During web api bootstrap initialize the etcd client and read the keys , registered a watch and loaded the IConfiguration with the keys.
- Change any value of the key in etcd
- We are able to get the call back with the updated value.
- Stop etcd and start it again.
- Change the value of any key
- Unable to receive the watch callback. It seems like with the etcd restart the connection was lost and has not reestablished the connection.
Expected behavior Application should be able to receive the callback notification with the updated key / value
@satishviswanathan Do you get any sort of exception on etcd server restart ?
@shubhamranjan i'm not seeing exception but seems like the application is losing the connection with etcd and subsequent watch notification is not received after restart.
Noted. Etcd restarts haven't been handled yet for etcd watches are neither is there a provision for an exception handler. This are is something we need to improvise on.
@yoricksmeets Did you get a chance to work on the watch manager ?
I connect to etcd cluster with dotnet-etcd version 3.2.0. the cluster was compose of three etcd nodes. In beginning I have connected to the cluster succussfully and watch some datas changed. But when I shutdown etcd nodes one by one, the applicaton thorow a unhandled exception as follows: Grpc.Core.RpcException HResult=0x80131500 Message=Status(StatusCode=Unknown, Detail="Stream removed") Source=System.Private.CoreLib StackTrace: at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Grpc.Core.Internal.ClientResponseStream`2.<MoveNext>d__5.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<>c__DisplayClass122_1.<<WatchRange>b__0>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<WatchRange>d__122.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() at System.Threading.ThreadPoolWorkQueue.Dispatch()
@yanhua1012 Can you explain a bit as it happens ?. For e.g.
- 1 node goes down, 1 error occurs and subsequent work and so on.......
step1. the etcd client connect to etcd cluster successfully and watch some datas step2. shutdown the first node of cluster step3. shutdown the second node of cluster step4. start the second node of cluster then one unhandled exception as follows happened (not immediately)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult() at Grpc.Core.Internal.ClientResponseStream`2.<MoveNext>d__5.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<>c__DisplayClass122_1.<<WatchRange>b__0>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<WatchRange>d__122.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() at System.Threading.ThreadPoolWorkQueue.Dispatch()
another case step1. the etcd client connect to etcd cluster successfully and watch some datas step2. shutdown the first node of cluster step3. shutdown the second node of cluster step4. shutdown the third node of cluster then one unhandled exception as follows happened (not immediately) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at dotnet_etcd.EtcdClient.<WatchRange>d__122.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() at System.Threading.ThreadPoolWorkQueue.Dispatch()
@shubhamranjan , Just reaching out to see if this scenario is being considered in future release as a feature.
I'm using etcd client in my web api application and have implemented watch which works fine. However i'm facing the following issue.
Etcd connection gets disconnected
To Reproduce Steps to reproduce the behavior:
- During web api bootstrap initialize the etcd client and read the keys , registered a watch and loaded the IConfiguration with the keys.
- Change any value of the key in etcd
- We are able to get the call back with the updated value.
- Stop etcd and start it again.
- Change the value of any key
- Unable to receive the watch callback. It seems like with the etcd restart the connection was lost and has not reestablished the connection.
Expected behavior Application should be able to receive the callback notification with the updated key / value
silent reconnect to watch may be dangerous.
if watch starts from specific revision, callbacks called again from these revision and its break watch stream ordered guarantee
https://etcd.io/docs/v3.3/learning/api_guarantees/#consistency
if watch start from current revision, reconnect can skip some events
we are sometimes got both these problems.
but root of the problem is that occasionally connection didnt closed, and watch wait infintly. and Client code cant handle this.
that can be easy reproduce with kubectl port-forward to any etcd in k8s, then start local etcdctl watch on etcd and stop port-forwarding.
problem may be solved with grpc pings, but they are available only from dotnet6+ https://docs.microsoft.com/en-us/aspnet/core/grpc/performance?view=aspnetcore-6.0#keep-alive-pings
Thanks @setood . I see a lot of great features have been added to the grpc client. I will try to re design the whole client again.
IMO, I think it should be fair to say that people should be able to easily upgrade their dotnet versions, considering breaking changes are not that much in most areas.