Controller in leader election loop after losing network access
My controller with leader election lost network access to the API Server. After losing leader election, the controller is restarted once. On restart the controller repeatedly tries to get the lease, but doesn't get network access. It ends up stuck in a loop of getting "connection refused" when it tries to get the leader.
In my case the controller was running with replicas=1 meaning the error stopped reconciliation of resources until someone checked and manually restarted the pod
It would be useful if the controller failed when in this state so users would have a signal about what's wrong in this case.
To reproduce
- Start a controller with leader election
- Disrupt network access to the API Server:
nsenter -t $PID -n iptables -A OUTPUT -p tcp --dport 443 -j DROP - Observe the controller restarting once and then getting stuck in a loop with logs like:
E0822 08:45:07.492385 1 leaderelection.go:436] error retrieving resource lock ....
E0822 08:45:09.903736 1 leaderelection.go:436] error retrieving resource lock ....
E0822 08:45:12.148482 1 leaderelection.go:436] error retrieving resource lock ....
In my case I'm trying out a custom health check for a liveness probe which will cause the container to crash if it can't reach the APIServer:
func APIConnectionCheck(mgr ctrl.Manager) func(req *http.Request) error {
httpClient := mgr.GetHTTPClient()
healthzURL := strings.TrimSuffix(mgr.GetConfig().Host, "/") + "/healthz"
return func(req *http.Request) error {
ctx, cancel := context.WithTimeout(req.Context(), 3*time.Second)
defer cancel()
httpReq, err := http.NewRequestWithContext(ctx, "GET", healthzURL, nil)
if err != nil {
return fmt.Errorf("failed to create request: %w", err)
}
// The HTTP client already has auth configured
resp, err := httpClient.Do(httpReq)
if err != nil {
return fmt.Errorf("API server unreachable: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("API server returned status %d", resp.StatusCode)
}
return nil
}
}
// Setting up the healthcheck with the controller manager
if err := mgr.AddHealthzCheck("healthz", APIConnectionCheck(mgr)); err != nil {
setupLog.Error(err, "unable to set up health check")
os.Exit(1)
}
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale