serving `TestControllerHA` is flakey

It seems like TestControllerHA in regular e2e is flakey - I wonder if it's because we haven't bumped this constant and potentailly other ones in the e2e scripts

https://github.com/knative/serving/blob/a0a1ac776600486a21b7714a33be84d8be18884e/test/ha/ha.go#L40

Originally posted by @dprotaso in https://github.com/knative/serving/issues/15236#issuecomment-2126054880

May 23 '24 14:05 dprotaso

/assign @skonto

May 23 '24 14:05 dprotaso

failed run here: https://prow.knative.dev/view/gs/knative-prow/logs/nightly_serving_main_periodic/1793933717098467328

(in case you need logs)

May 24 '24 13:05 dprotaso

@dprotaso The failed run prints the following before timing out:

    ha.go:92: WaitForNewLeaders[controller] still see intersection: map[controller-d476b879b-hdtwr:{}]

It seems that there are enough new leaders (first check passes) but an old lease is not deleted, although we delete all the old pods one by one. The initial leader set (old) is:

 controller_test.go:53: Got initial leader set: map[controller-d476b879b-6vpsg:{} controller-d476b879b-8jbrb:{} controller-d476b879b-hdtwr:{} controller-d476b879b-pmlmc:{}]

Wrt the number of controllers each time we enable a feature related to net-certmanager-reconciler we make sure we disable it after the corresponding tests are done. Also we restart the controller pod. I think an exception to that is https tests where we run all tests end to end with https enabled (to be precise there we could increase the number of controllers). Here were care about the regular run and it does not seem to be related as we have enough leaders (due to leases being equally enough). I think the question is if have old leases staying around (eg due to the new reconciler).

:thinking:

May 27 '24 09:05 skonto

It seems that is the case when I restart the pod the certificate's reconciler lease is not being garbage collected:

kubectl -n knative-serving get lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-01 -oyaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2024-05-27T09:47:49Z"
  name: controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-01
  namespace: knative-serving
  resourceVersion: "18567"
  uid: 0f7b4bd6-3a67-410d-bfb2-3497f18d1f35
spec:
  acquireTime: "2024-05-27T09:57:10.364190Z"
  holderIdentity: ""
  leaseDurationSeconds: 1
  leaseTransitions: 0
  renewTime: "2024-05-27T09:57:10.364190Z"

In our tests we have:

		if lease.Spec.HolderIdentity == nil {
			continue
		}

We never check about empty string though (I verified that empty string is what our tests see). So we either delete all related leases after we restart the controller pod or we add that condition in tests to skip reporting the old lease related to the certificate reconciler.

May 27 '24 10:05 skonto

@skonto looks like this is still flaking - i think we also need to update the controller count in the test

https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_serving/15277/istio-latest-no-mesh_serving_main/1796264673356550144

May 30 '24 21:05 dprotaso

@dprotaso

I can try that (we dont always have 10 btw, depends on whether we have encryption on) but the failure though shows:

    ha.go:93: WaitForNewLeaders[controller] still see intersection: map[controller-6fd686dfd4-glxh4:{}]

So that means not all reconcilers of the new pod are becoming leaders on time?

Could you elaborate on how the reconciler counter is related with the failure we see?

The code is the following:

	err := wait.PollUntilContextTimeout(ctx, time.Second, time.Minute, true, func(ctx context.Context) (bool, error) {
		currLeaders, err := GetLeaders(ctx, t, client, deploymentName, namespace)
		if err != nil {
			return false, err
		}
		if len(currLeaders) < n {
			t.Logf("WaitForNewLeaders[%s] not enough leaders, got: %d, want: %d", deploymentName, len(currLeaders), n)
			return false, nil
		}
		l := sets.New[string](currLeaders...)
		if previousLeaders.HasAny(currLeaders...) {
			t.Logf("WaitForNewLeaders[%s] still see intersection: %v", deploymentName, previousLeaders.Intersection(l))
			return false, nil
		}
		leaders = l
		return true, nil
	})

n = buckets * # reconcilers. Since we have 9*10 in tests, the check if len(currLeaders) < n { is going to pass anyway if we have 9 or 10 reconcilers, because at some point we will have 90 or 100 leaders (does not matter if the certififcate reconciler is on).

Now in the test we get the leader list and then based on that we delete each pod later on:

controller_test.go:53: Got initial leader set: map[controller-6fd686dfd4-6znfs:{} controller-6fd686dfd4-dfv4q:{} controller-6fd686dfd4-glxh4:{} controller-6fd686dfd4-pdksh:{}]

If controller-6fd686dfd4-glxh4 is deleted then we should not see any lease related to that pod in the holdidentity field but maybe sometime leases are not updated fast enough. Anyway I will debug further.

May 31 '24 09:05 skonto

For the sake of being precise I am working on setting the reconcile num to 10 (it is tricky as we need to know if we are running with encryption on) but does not seem the root cause.

May 31 '24 13:05 skonto

@dprotaso could be that we are facing something similar to https://github.com/kubernetes/client-go/issues/1155, maybe although we do graceful shutdown the lease is not released when we restart the pod (disable the certificate reconciler) and thus the relic of the certificate lease. I will dig into it.

Latest run here: https://storage.googleapis.com/knative-prow/pr-logs/pull/knative_serving/15312/istio-latest-no-mesh_serving_main/1799043483181256704/build-log.txt. One lease is left without identity update. I am wondering if the stop leading callback was called for that bucket when controller pod was killed.

ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.00-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.01-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.02-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.03-of-10 as it has no holder
ha.go:70: GetLeaders[controller] adding lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.04-of-10 for pod controller-7f689f785c-jtsgf
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.05-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.06-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.07-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.08-of-10 as it has no holder
ha.go:61: GetLeaders[controller] skipping lease controller.knative.dev.serving.pkg.reconciler.certificate.reconciler.09-of-10 as it has no holder

Jun 07 '24 23:06 skonto

A quick thing to check is to update the client-go libs and see if there is any change - in case they've fixed any bugs.

Generally we should be releasing the leases on a normal shutdown - I recall the library did that in the past.

Jun 10 '24 13:06 dprotaso

@dprotaso I run some tests. It seems we are not waiting enough for the lease cleanup. Check here. With 110 leases, 30 secs graceful shutdown is not enough. Also as a side note if you have to many leases you might run into rate limit issues.

Jun 10 '24 14:06 skonto

As @dprotaso pointed out it could be due to a bug in client-go such as https://github.com/kubernetes/client-go/commit/f9eba8e8c3be1ac446fc7f8e4a4c4b74f4408fc0. I am exploring why leases are not being updated.

Jun 14 '24 10:06 skonto

Latest update here: https://github.com/knative/serving/pull/15321#issuecomment-2170415789. Go client lib is not updating all leases.

Jun 15 '24 17:06 skonto