Concourse gets into a restart loop if the web nodes take long to start up
In our large scale environment, the web nodes get into a restarting loop whenever we do an upgrade or purely restarting the web node.
In our case, this is usually whenever we upgrade and we see that the upgraded web nodes will have a status of CrashLoopBackOff and will switch to Running state and then eventually go back to CrashLoopBackOff. There is usually one web node that is still up and running, which we assume is the node that is kept so it can be a rolling deploy.
The failures we see on the crashing web nodes are Liveness probe failed: Get http://<ip>:80/api/v1/info: dial tcp <ip>:80: connect: connection refused, which had us think that because the web nodes were taking so long to come up (possibly due to migrations) the liveness probe started and was not getting a response so it ended up killing the node. We eventually configured the initialDelaySeconds on the liveness probe so that it starts checking the health of the web node after 5 minutes and this fixed the crashing web nodes.
Having to configure the initialDelaySeconds on the liveness probe isn't an optimal solution for solving the problem of slow migrations. In most cases, the start up of the web nodes should be fairly quick so configuring a long initial delay will cause k8s to take much longer to determine if the web nodes are healthy after they start up. Maybe we can look into configuring a default for the startupProbe https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe that will allow for slower starting of web nodes due to slow migrations? Reading the docs https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes it seems like we can configure a failureThreshold * periodSeconds that will be long enough to cover the worse case startup time.
Possible solution for this may be to run the concourse migration command in an initContainer for the web-deployment. initContainers don't support any *Probes
Also, init containers do not support lifecycle, livenessProbe, readinessProbe, or startupProbe because they must run to completion before the Pod can be ready.
so the migration can take as long as is needed and we don't run into this issue where our web pods are in a CrashLoop trying to apply the most recent migration.
I think helm hook is the answer to this. The chart has a job that runs the migration "pre-upgrade"
https://helm.sh/docs/topics/charts_hooks/