dotnet-operator-sdk icon indicating copy to clipboard operation
dotnet-operator-sdk copied to clipboard

[bug]: LeaderAwareResourceWatcher does not regain leadership after network issues

Open PSanetra opened this issue 7 months ago • 1 comments

Describe the bug

I have observed that the LeaderAwareResourceWatcher looses and never regains leadership after network issues.

To reproduce

  1. LeaderAwareResourceWatcher.StartAsync()
  2. Wait for LeaderAwareResourceWatcher to be connected
  3. Create network issue (e.g. reset_peer with toxiproxy)
  4. This instance stopped leading, stopping watcher. will be logged
  5. Resolve network issue
  6. This instance started leading, starting watcher. is not logged again

Expected behavior

This instance started leading, starting watcher. should be logged after network issue is resolved.

Screenshots

No response

Additional Context

Version: 8.0.0-pre.29

PSanetra avatar Dec 04 '23 17:12 PSanetra

I think the LeaderElector.RunAsync() API is very confusing and not documented, but it seems like the OnStoppedLeading event is only called in the finally clause of RunAsync: https://github.com/kubernetes-client/csharp/blob/15ad5bdfc451debbca2e0d23821cef4393885525/src/KubernetesClient/LeaderElection/LeaderElector.cs#L104-L108

Therefore I guess it is necessary to call RunAsync in a loop until the LeaderAwareResourceWatcher is stopped.

PSanetra avatar Dec 04 '23 17:12 PSanetra