dotnet-operator-sdk [bug]: Leadership election faulty when network timeout issues present

Describe the bug

Still related to the recently closed https://github.com/buehler/dotnet-operator-sdk/issues/677

it is possible to get the operator in a state where leader election with high availability does not behave as expected

To reproduce

create an operator in kubernetes with two pods
allow the operator to do some work with some custom resources in another namespace than the operator
apply a network policy on the namespace to stop all network traffic on the same namespace as the deployed operator

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-kube-api
  namespace: my-operator-namespace
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.1/32
    ports:
    - protocol: TCP
      port: 443
EOF

wait for a while, until the timeout errors appear (takes about 10 minutes or so)
remove the network policy

at this point you will see that either both pods are acting as leaders, and both are processing resources, or neither pods doing any work until a restart of either pod happens...

Expected behavior

When the deny network policy is applied for 15 minutes+ and then removed, only one pod should continue processing while other pod should be idle

also if the process exited with an error after a number of retries have been unsuccessful, that would be okay too, but this is up for a wider discussion

Screenshots

No response

Additional Context

No response

Jun 28 '24 11:06 slacki123

@exextatic hey, thank you for working on this issue earlier. Could you please have a look if you still see the same behaviour and opine? 🙏

Jun 28 '24 13:06 slacki123

@exextatic hey, thank you for working on this issue earlier. Could you please have a look if you still see the same behaviour and opine? 🙏

I wasn't able to reproduce this in my environment, do you have any logs from the operators and info as to which instance the k8s lease thinks is the leader after step 5?

Jul 02 '24 04:07 exextatic

I'm replicating it and the issue is quite consistent

using two pods with the logs side by side. Initially they start up, and only one takes the role of a leader, which is good.

Then I apply the network policy and wait for both watchers to time out

Then when I delete the network policy again, both pods then continue processing the resources, and continue to requeue every 60 seconds on both pods

It's worth mentioning that after deletion of the network policy, the top pod started to process resources immediately, and the bottom one took some time, and then started again

After restarting one of the pods, the behaviour continues as expected again

Jul 02 '24 16:07 slacki123

Okay I have a new hypothesis around this, but will need someone to confirm if this makes sense.

I ran the same test again, added the deny network policy and filtering only to show the resources watcher logs, the logs showed this

pod 1


03/07/2024, 10:19:00 | 03-07-2024 09:19:00.3227 info: Microsoft.Hosting.Lifetime[14] Now listening on: http://[::]:8282
-- | --
03/07/2024, 10:19:00 | 03-07-2024 09:19:00.3375 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
03/07/2024, 10:19:00 | 03-07-2024 09:19:00.3403 info: Microsoft.Hosting.Lifetime[0] Hosting environment: Production
03/07/2024, 10:19:00 | 03-07-2024 09:19:00.3404 info: Microsoft.Hosting.Lifetime[0] Content root path: /usr/src/
03/07/2024, 10:21:37 | 03-07-2024 09:21:37.5003 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] This instance started leading, starting watcher.
03/07/2024, 10:21:37 | 03-07-2024 09:21:37.5006 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Starting resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:21:37 | 03-07-2024 09:21:37.5418 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Started resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:24:53 | 03-07-2024 09:24:53.0865 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] This instance stopped leading, stopping watcher.
03/07/2024, 10:24:53 | 03-07-2024 09:24:53.0886 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Stopping resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:24:53 | 03-07-2024 09:24:53.1653 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Stopped resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:28:01 | 03-07-2024 09:28:01.3211 fail: KubeOps.Operator.Queue.EntityRequeueBackgroundService[0] Queued reconciliation for the entity of type V1FinbourneAppsManifestEntity for "FinbourneApps/operator-lusid" failed. System.Threading.Tasks.TaskCanceledException: A task was canceled.    at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)    at System.Net.Http.HttpClient.<SendAsync>g__Core\|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)    at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)    at k8s.AbstractKubernetes.ICustomObjectsOperations_GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)    at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)    at k8s.CustomObjectsOperationsExtensions.GetNamespacedCustomObjectAsync[T](ICustomObjectsOperations operations, String group, String version, String namespaceParameter, String plural, String name, CancellationToken cancellationToken)    at KubeOps.KubernetesClient.KubernetesClient.GetAsync[TEntity](String name, String namespace, CancellationToken cancellationToken)    at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.ReconcileSingleAsync(TEntity queued, CancellationToken cancellationToken)    at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.WatchAsync(CancellationToken cancellationToken)

pod 2


03/07/2024, 10:18:54 | 03-07-2024 09:18:54.9197 info: Microsoft.Hosting.Lifetime[14] Now listening on: http://[::]:8282
-- | --
03/07/2024, 10:18:54 | 03-07-2024 09:18:54.9307 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
03/07/2024, 10:18:54 | 03-07-2024 09:18:54.9328 info: Microsoft.Hosting.Lifetime[0] Hosting environment: Production
03/07/2024, 10:18:54 | 03-07-2024 09:18:54.9328 info: Microsoft.Hosting.Lifetime[0] Content root path: /usr/src/
03/07/2024, 10:19:11 | 03-07-2024 09:19:11.4554 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] This instance started leading, starting watcher.
03/07/2024, 10:19:11 | 03-07-2024 09:19:11.4558 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Starting resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:19:11 | 03-07-2024 09:19:11.8234 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Started resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:21:32 | 03-07-2024 09:21:32.4607 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] This instance stopped leading, stopping watcher.
03/07/2024, 10:21:32 | 03-07-2024 09:21:32.4620 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Stopping resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:21:32 | 03-07-2024 09:21:32.4645 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Stopped resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:26:22 | 03-07-2024 09:26:22.0738 fail: KubeOps.Operator.Queue.EntityRequeueBackgroundService[0] Queued reconciliation for the entity of type V1FinbourneAppsManifestEntity for "FinbourneApps/tenant-scheduler" failed. System.Threading.Tasks.TaskCanceledException: A task was canceled.    at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)    at System.Net.Http.HttpClient.<SendAsync>g__Core\|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)    at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)    at k8s.AbstractKubernetes.ICustomObjectsOperations_GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)    at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)    at k8s.CustomObjectsOperationsExtensions.GetNamespacedCustomObjectAsync[T](ICustomObjectsOperations operations, String group, String version, String namespaceParameter, String plural, String name, CancellationToken cancellationToken)    at KubeOps.KubernetesClient.KubernetesClient.GetAsync[TEntity](String name, String namespace, CancellationToken cancellationToken)    at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.ReconcileSingleAsync(TEntity queued, CancellationToken cancellationToken)    at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.WatchAsync(CancellationToken cancellationToken)
03/07/2024, 10:28:02 | 03-07-2024 09:28:02.0767 fail: KubeOps.Operator.Queue.EntityRequeueBackgroundService[0] Queued reconciliation for the entity of type V1FinbourneAppsManifestEntity for "FinbourneApps/finbourne-cluster-operator-keepalive" failed. System.Threading.Tasks.TaskCanceledException: A task was canceled.    at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)    at System.Net.Http.HttpClient.<SendAsync>g__Core\|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)    at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken)    at k8s.AbstractKubernetes.ICustomObjectsOperations_GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)    at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken)    at k8s.CustomObjectsOperationsExtensions.GetNamespacedCustomObjectAsync[T](ICustomObjectsOperations operations, String group, String version, String namespaceParameter, String plural, String name, CancellationToken cancellationToken)    at KubeOps.KubernetesClient.KubernetesClient.GetAsync[TEntity](String name, String namespace, CancellationToken cancellationToken)    at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.ReconcileSingleAsync(TEntity queued, CancellationToken cancellationToken)    at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.WatchAsync(CancellationToken cancellationToken)

I have about 8 custom resources, and 2 of them had their corresponding "EntityRequeueBackgroundService" return a timeout/operation cancelled error.

after I deleted the network policy to allow normal operations, the 6 custom resources on pod 1 continued to be processed, BUT for the 2 custom resources that have had EntityRequeueBackgroundService operation canceled errors - they are no longer being requeued.

Soon after, pod 2 became the leader, and started processing resources alongside pod 1. However for custom resources on pod 2 that also have EntityRequeueBackgroundService task cancelled exceptions, those resources also have not started being processed.

My hypothesis now is that the reason why the resources continued processing on both pods was because while the instance stopped leading, the reconciliation async task was still running, and when it finished, it triggered another process via a requeue. This caused the instance to process resources indefinitely even if the instance itself was not leading.

I wonder if we need some way to prevent requeues if the instance is not leading? Perhaps here, we need to skip reconciliation if the instance is not leading? https://github.com/buehler/dotnet-operator-sdk/blob/main/src/KubeOps.Operator/Queue/EntityRequeueBackgroundService.cs#L118

@exextatic @buehler do you have any thoughts on this?

EDIT:

Also, another thing to add, the 2 resources that have had the EntityRequeueBackgroundService operation canceled exceptions triggered, have not been picked up by the operator again even after the instance started leading again. So those stopped being processed until deleted and recreated again, or the pod has been restarted :/ (So this means that if network issues lasted long enough, and the EntityRequeueBackgroundService for every entity was cancelled, then even after a pod starts leading, none of the resources on either pods will be processed until the pod restarts or deletion/modification happens.) In this case, we would need something else to ensure that the cancelled requeues will all pick up again after the instance leads. Even though an "ADDED" event was triggered for these resources after the instance started leading, the ReconcileAsync did not trigger. I suppose it's because the generation did not change here https://github.com/buehler/dotnet-operator-sdk/blob/main/src/KubeOps.Operator/Watcher/ResourceWatcher%7BTEntity%7D.cs#L226 perhaps when an instance stops leading we should also clear the entity cache?

Jul 03 '24 10:07 slacki123

Holy cow. This could actually be the issue. I do not recall implementing such a check for requeue. As such, this could be the error that the requeued entities are process regardless if the current pod is leading or not.

Jul 04 '24 07:07 buehler

That would make sense. Any suggestions on how to approach the fix for it?

Jul 04 '24 14:07 slacki123

hmmm yes, actually when quitting the watcher (since the connection dropped or the controller went out of scope) we could just clear the queue

Jul 11 '24 12:07 buehler