dotnet-operator-sdk
dotnet-operator-sdk copied to clipboard
[bug]: LeaderAwareResourceWatcher does not regain leadership after network issues
Describe the bug
I have observed that the LeaderAwareResourceWatcher
looses and never regains leadership after network issues.
To reproduce
-
LeaderAwareResourceWatcher.StartAsync()
- Wait for
LeaderAwareResourceWatcher
to be connected - Create network issue (e.g. reset_peer with toxiproxy)
-
This instance stopped leading, stopping watcher.
will be logged - Resolve network issue
-
This instance started leading, starting watcher.
is not logged again
Expected behavior
This instance started leading, starting watcher.
should be logged after network issue is resolved.
Screenshots
No response
Additional Context
Version: 8.0.0-pre.29
I think the LeaderElector.RunAsync()
API is very confusing and not documented, but it seems like the OnStoppedLeading
event is only called in the finally clause of RunAsync
:
https://github.com/kubernetes-client/csharp/blob/15ad5bdfc451debbca2e0d23821cef4393885525/src/KubernetesClient/LeaderElection/LeaderElector.cs#L104-L108
Therefore I guess it is necessary to call RunAsync
in a loop until the LeaderAwareResourceWatcher
is stopped.