controller-runtime Controller in leader election loop after losing network access

My controller with leader election lost network access to the API Server. After losing leader election, the controller is restarted once. On restart the controller repeatedly tries to get the lease, but doesn't get network access. It ends up stuck in a loop of getting "connection refused" when it tries to get the leader.

In my case the controller was running with replicas=1 meaning the error stopped reconciliation of resources until someone checked and manually restarted the pod

It would be useful if the controller failed when in this state so users would have a signal about what's wrong in this case.

To reproduce

Start a controller with leader election
Disrupt network access to the API Server: nsenter -t $PID -n iptables -A OUTPUT -p tcp --dport 443 -j DROP
Observe the controller restarting once and then getting stuck in a loop with logs like:

E0822 08:45:07.492385       1 leaderelection.go:436] error retrieving resource lock ....
E0822 08:45:09.903736       1 leaderelection.go:436] error retrieving resource lock ....
E0822 08:45:12.148482       1 leaderelection.go:436] error retrieving resource lock ....

Aug 27 '25 10:08 killianmuldoon

In my case I'm trying out a custom health check for a liveness probe which will cause the container to crash if it can't reach the APIServer:

func APIConnectionCheck(mgr ctrl.Manager) func(req *http.Request) error {
	httpClient := mgr.GetHTTPClient()

	healthzURL := strings.TrimSuffix(mgr.GetConfig().Host, "/") + "/healthz"
	return func(req *http.Request) error {
		ctx, cancel := context.WithTimeout(req.Context(), 3*time.Second)
		defer cancel()
		httpReq, err := http.NewRequestWithContext(ctx, "GET", healthzURL, nil)
		if err != nil {
			return fmt.Errorf("failed to create request: %w", err)
		}

		// The HTTP client already has auth configured
		resp, err := httpClient.Do(httpReq)
		if err != nil {
			return fmt.Errorf("API server unreachable: %w", err)
		}
		defer resp.Body.Close()

		if resp.StatusCode != http.StatusOK {
			return fmt.Errorf("API server returned status %d", resp.StatusCode)
		}
		return nil
	}
}

// Setting up the healthcheck with the controller manager
if err := mgr.AddHealthzCheck("healthz", APIConnectionCheck(mgr)); err != nil {
	setupLog.Error(err, "unable to set up health check")
	os.Exit(1)
}

Aug 27 '25 11:08 killianmuldoon

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 25 '25 11:11 k8s-triage-robot