zero-to-jupyterhub-k8s autohttp's `secret-sync` container restarting leads to unready pod and disruption of network traffic

trafficstars

My z2jh GKE cluster stalled badly under fairly mild load today, giving "service refused" errors.

The autohttps-.... pod reported many restarts, and kubectl describe showed that this was entirely due to restarts in the secret-sync container, with no restarts in the traefik container.

kubectl logs --previous autohttps-9fdcfc86c-9jdwx secret-sync started with these lines:

2021-05-10 09:29:19,247 INFO /usr/local/bin/acme-secret-sync.py watch-save --label=app=jupyterhub --label=release=jhub --label=chart=jupyterhub-0.11.1 --label=heritage=secret-sync proxy-public-tls-acme acme.json /etc/acme/acme.json
2021-05-10 09:30:24,876 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff5883de2b0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/jhub/secrets/proxy-public-tls-acme

I noticed similar errors, and some restarts in the hub pod:

WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f258683d370>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/jhub/pods?fieldSelector=&labelSelector=component%3Dsingleuser-server

(leading to an error), and also in the user-scheduler pods (restarts, errors):

E0510 09:27:20.654039       1 leaderelection.go:325] error retrieving resource lock jhub/user-scheduler-lock: Get "https://10.92.0.1:443/api/v1/namespaces/jhub/endpoints/user-scheduler-lock?timeout=10s": dial tcp 10.92.0.1:443: connect: connection refused

Traefik image reported as traefik:v2.3.7.

Helm chart is 0.11.1.

Erik Sundell commented over on Gitter:

Note that it may not be a problem that the container restarts in practice.

It is a problem if the pod isnt ready during that process though

Then no network traffic is accepted

But the container is only relevant on startup

May 11 '21 16:05 matthew-brett

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

May 11 '21 16:05 welcome[bot]

@matthew-brett the root cause seems to be an unreliable access to the k8s api-server - either because of networking or because of the api-server is struggling. To solve the root cause is out of scope for the z2jh helm chart development, but, I think it is in scope to make the secret-sync sidecar container in the autohttps pod not crash and restart when a k8s api-server requests fail. Instead the secret-sync container should log the failure, wait, and retry.

Action point to mitigate root cause

[ ] Make autohttps pod's secret-sync container have some error handling for failed k8s api-requests.

Background info about secret-sync

The secret-sync container is responsible for stashing away an acquired TLS cert stored by Traefik in a local file to a k8s Secret, and making sure to load that on startup which it does when started separately from an initContainer as well.

Quick input on practical workarounds to root cause

On GKE, you can have a single k8s api-server (zonal cluster) or three running in parallel (regional cluster). If the issue is that the k8s api-server is not responding due to load, then perhaps having a regional cluster make sense.

There could also be a failure related to networking, perhaps related to having a CNI (calico or other) that is acting up or perhaps by using istio that routes network traffic and it is acting up for some reason.

I'll consider this out of scope for further discussion in this GitHub issue though.

May 11 '21 16:05 consideRatio

To update https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/main/images/secret-sync/acme-secret-sync.py to be more reliable, providing warnings on failures instead of crashing, would be a appreciated PR.

May 16 '22 17:05 consideRatio

zero-to-jupyterhub-k8s zero-to-jupyterhub-k8s copied to clipboard

autohttp's `secret-sync` container restarting leads to unready pod and disruption of network traffic

Action point to mitigate root cause

Background info about secret-sync

Quick input on practical workarounds to root cause

zero-to-jupyterhub-k8s
zero-to-jupyterhub-k8s copied to clipboard