[bug]: Leadership election faulty when network timeout issues present
Describe the bug
Still related to the recently closed https://github.com/buehler/dotnet-operator-sdk/issues/677
it is possible to get the operator in a state where leader election with high availability does not behave as expected
To reproduce
- create an operator in kubernetes with two pods
- allow the operator to do some work with some custom resources in another namespace than the operator
- apply a network policy on the namespace to stop all network traffic on the same namespace as the deployed operator
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-kube-api
namespace: my-operator-namespace
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 10.0.0.1/32
ports:
- protocol: TCP
port: 443
EOF
- wait for a while, until the timeout errors appear (takes about 10 minutes or so)
- remove the network policy
at this point you will see that either both pods are acting as leaders, and both are processing resources, or neither pods doing any work until a restart of either pod happens...
Expected behavior
When the deny network policy is applied for 15 minutes+ and then removed, only one pod should continue processing while other pod should be idle
also if the process exited with an error after a number of retries have been unsuccessful, that would be okay too, but this is up for a wider discussion
Screenshots
No response
Additional Context
No response
@exextatic hey, thank you for working on this issue earlier. Could you please have a look if you still see the same behaviour and opine? 🙏
@exextatic hey, thank you for working on this issue earlier. Could you please have a look if you still see the same behaviour and opine? 🙏
I wasn't able to reproduce this in my environment, do you have any logs from the operators and info as to which instance the k8s lease thinks is the leader after step 5?
I'm replicating it and the issue is quite consistent
using two pods with the logs side by side. Initially they start up, and only one takes the role of a leader, which is good.
Then I apply the network policy and wait for both watchers to time out
Then when I delete the network policy again, both pods then continue processing the resources, and continue to requeue every 60 seconds on both pods
It's worth mentioning that after deletion of the network policy, the top pod started to process resources immediately, and the bottom one took some time, and then started again
After restarting one of the pods, the behaviour continues as expected again
Okay I have a new hypothesis around this, but will need someone to confirm if this makes sense.
I ran the same test again, added the deny network policy and filtering only to show the resources watcher logs, the logs showed this
pod 1
03/07/2024, 10:19:00 | 03-07-2024 09:19:00.3227 info: Microsoft.Hosting.Lifetime[14] Now listening on: http://[::]:8282
-- | --
03/07/2024, 10:19:00 | 03-07-2024 09:19:00.3375 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
03/07/2024, 10:19:00 | 03-07-2024 09:19:00.3403 info: Microsoft.Hosting.Lifetime[0] Hosting environment: Production
03/07/2024, 10:19:00 | 03-07-2024 09:19:00.3404 info: Microsoft.Hosting.Lifetime[0] Content root path: /usr/src/
03/07/2024, 10:21:37 | 03-07-2024 09:21:37.5003 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] This instance started leading, starting watcher.
03/07/2024, 10:21:37 | 03-07-2024 09:21:37.5006 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Starting resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:21:37 | 03-07-2024 09:21:37.5418 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Started resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:24:53 | 03-07-2024 09:24:53.0865 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] This instance stopped leading, stopping watcher.
03/07/2024, 10:24:53 | 03-07-2024 09:24:53.0886 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Stopping resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:24:53 | 03-07-2024 09:24:53.1653 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Stopped resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:28:01 | 03-07-2024 09:28:01.3211 fail: KubeOps.Operator.Queue.EntityRequeueBackgroundService[0] Queued reconciliation for the entity of type V1FinbourneAppsManifestEntity for "FinbourneApps/operator-lusid" failed. System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.<SendAsync>g__Core\|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken) at k8s.AbstractKubernetes.ICustomObjectsOperations_GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken) at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken) at k8s.CustomObjectsOperationsExtensions.GetNamespacedCustomObjectAsync[T](ICustomObjectsOperations operations, String group, String version, String namespaceParameter, String plural, String name, CancellationToken cancellationToken) at KubeOps.KubernetesClient.KubernetesClient.GetAsync[TEntity](String name, String namespace, CancellationToken cancellationToken) at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.ReconcileSingleAsync(TEntity queued, CancellationToken cancellationToken) at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.WatchAsync(CancellationToken cancellationToken)
pod 2
03/07/2024, 10:18:54 | 03-07-2024 09:18:54.9197 info: Microsoft.Hosting.Lifetime[14] Now listening on: http://[::]:8282
-- | --
03/07/2024, 10:18:54 | 03-07-2024 09:18:54.9307 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
03/07/2024, 10:18:54 | 03-07-2024 09:18:54.9328 info: Microsoft.Hosting.Lifetime[0] Hosting environment: Production
03/07/2024, 10:18:54 | 03-07-2024 09:18:54.9328 info: Microsoft.Hosting.Lifetime[0] Content root path: /usr/src/
03/07/2024, 10:19:11 | 03-07-2024 09:19:11.4554 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] This instance started leading, starting watcher.
03/07/2024, 10:19:11 | 03-07-2024 09:19:11.4558 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Starting resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:19:11 | 03-07-2024 09:19:11.8234 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Started resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:21:32 | 03-07-2024 09:21:32.4607 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] This instance stopped leading, stopping watcher.
03/07/2024, 10:21:32 | 03-07-2024 09:21:32.4620 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Stopping resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:21:32 | 03-07-2024 09:21:32.4645 info: KubeOps.Operator.Watcher.LeaderAwareResourceWatcher[0] Stopped resource watcher for V1FinbourneAppsManifestEntity.
03/07/2024, 10:26:22 | 03-07-2024 09:26:22.0738 fail: KubeOps.Operator.Queue.EntityRequeueBackgroundService[0] Queued reconciliation for the entity of type V1FinbourneAppsManifestEntity for "FinbourneApps/tenant-scheduler" failed. System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.<SendAsync>g__Core\|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken) at k8s.AbstractKubernetes.ICustomObjectsOperations_GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken) at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken) at k8s.CustomObjectsOperationsExtensions.GetNamespacedCustomObjectAsync[T](ICustomObjectsOperations operations, String group, String version, String namespaceParameter, String plural, String name, CancellationToken cancellationToken) at KubeOps.KubernetesClient.KubernetesClient.GetAsync[TEntity](String name, String namespace, CancellationToken cancellationToken) at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.ReconcileSingleAsync(TEntity queued, CancellationToken cancellationToken) at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.WatchAsync(CancellationToken cancellationToken)
03/07/2024, 10:28:02 | 03-07-2024 09:28:02.0767 fail: KubeOps.Operator.Queue.EntityRequeueBackgroundService[0] Queued reconciliation for the entity of type V1FinbourneAppsManifestEntity for "FinbourneApps/finbourne-cluster-operator-keepalive" failed. System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts) at System.Net.Http.HttpClient.<SendAsync>g__Core\|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at k8s.Kubernetes.SendRequestRaw(String requestContent, HttpRequestMessage httpRequest, CancellationToken cancellationToken) at k8s.AbstractKubernetes.ICustomObjectsOperations_GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken) at k8s.AbstractKubernetes.k8s.ICustomObjectsOperations.GetNamespacedCustomObjectWithHttpMessagesAsync[T](String group, String version, String namespaceParameter, String plural, String name, IReadOnlyDictionary`2 customHeaders, CancellationToken cancellationToken) at k8s.CustomObjectsOperationsExtensions.GetNamespacedCustomObjectAsync[T](ICustomObjectsOperations operations, String group, String version, String namespaceParameter, String plural, String name, CancellationToken cancellationToken) at KubeOps.KubernetesClient.KubernetesClient.GetAsync[TEntity](String name, String namespace, CancellationToken cancellationToken) at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.ReconcileSingleAsync(TEntity queued, CancellationToken cancellationToken) at KubeOps.Operator.Queue.EntityRequeueBackgroundService`1.WatchAsync(CancellationToken cancellationToken)
I have about 8 custom resources, and 2 of them had their corresponding "EntityRequeueBackgroundService" return a timeout/operation cancelled error.
after I deleted the network policy to allow normal operations, the 6 custom resources on pod 1 continued to be processed, BUT for the 2 custom resources that have had EntityRequeueBackgroundService operation canceled errors - they are no longer being requeued.
Soon after, pod 2 became the leader, and started processing resources alongside pod 1. However for custom resources on pod 2 that also have EntityRequeueBackgroundService task cancelled exceptions, those resources also have not started being processed.
My hypothesis now is that the reason why the resources continued processing on both pods was because while the instance stopped leading, the reconciliation async task was still running, and when it finished, it triggered another process via a requeue. This caused the instance to process resources indefinitely even if the instance itself was not leading.
I wonder if we need some way to prevent requeues if the instance is not leading? Perhaps here, we need to skip reconciliation if the instance is not leading? https://github.com/buehler/dotnet-operator-sdk/blob/main/src/KubeOps.Operator/Queue/EntityRequeueBackgroundService.cs#L118
@exextatic @buehler do you have any thoughts on this?
EDIT:
Also, another thing to add, the 2 resources that have had the EntityRequeueBackgroundService operation canceled exceptions triggered, have not been picked up by the operator again even after the instance started leading again. So those stopped being processed until deleted and recreated again, or the pod has been restarted :/ (So this means that if network issues lasted long enough, and the EntityRequeueBackgroundService for every entity was cancelled, then even after a pod starts leading, none of the resources on either pods will be processed until the pod restarts or deletion/modification happens.) In this case, we would need something else to ensure that the cancelled requeues will all pick up again after the instance leads. Even though an "ADDED" event was triggered for these resources after the instance started leading, the ReconcileAsync did not trigger. I suppose it's because the generation did not change here https://github.com/buehler/dotnet-operator-sdk/blob/main/src/KubeOps.Operator/Watcher/ResourceWatcher%7BTEntity%7D.cs#L226 perhaps when an instance stops leading we should also clear the entity cache?
Holy cow. This could actually be the issue. I do not recall implementing such a check for requeue. As such, this could be the error that the requeued entities are process regardless if the current pod is leading or not.
That would make sense. Any suggestions on how to approach the fix for it?
hmmm yes, actually when quitting the watcher (since the connection dropped or the controller went out of scope) we could just clear the queue