Increase start_period for default healthcheck
Background
During zero-downtime deployments, sometime the service fails to start:
[09-Jun-2025 17:16:27] NOTICE: fpm is running, pid 131
[09-Jun-2025 17:16:27] NOTICE: ready to handle connections curl: (7) Failed to connect to localhost port 8080 after 1 ms: Could not connect to server ❌ There seems to be a failure in checking the NGINX + PHP-FPM. curl: (7) Failed to connect to localhost port 8080 after 1 ms: Could not connect to server HTTP Status Code: 000 ::1 - -
[09/Jun/2025:17:16:30 +0000] "GET /up HTTP/2.0" 200 1936 "-" "curl/8.12.1" "-" ✅ NGINX + PHP-FPM is running correctly. ::1 - -
[09/Jun/2025:17:16:33 +0000] "GET /up HTTP/2.0" 200 1936 "-" "curl/8.12.1" "-" ::1 - -
[09/Jun/2025:17:16:38 +0000] "GET /up HTTP/2.0" 200 1936 "-" "curl/8.12.1" "-"
This can mainly be from the Laravel Auto-runs.
From a user on Discord:
did more digging, finally got a deployment to work, but now I've ran into an old issue that subsequent deployments will always fail and then get rolled back
the container typically stays alive for about 30 seconds, gets through most of the Laravel automations, and sometimes gives the FPM error I shown above before being reported as a failure and being rolled back
I am still at a loss as to what's causing this, at first I thought it could be fpm printing to stderr, but that wasn't the case
it even sometimes starts giving successful health checks before it gets rolled back
docker inspect implies the container is exiting with code 137 (SIGTERM?)
Problem
- Our containers are using the default
start_periodof0s, which could create a lot of headaches
Proposed solution
- Set the
start_periodto30s
From the official docs:
start period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries. However, if a health check succeeds during the start period, the container is considered started and all consecutive failures will be counted towards the maximum number of retries.
Outcome
What did you expect?
- If
AUTORUN_ENABLEDistrue, we should give enough time for these services to start.
What happened instead?
- The container shuts down
Affected Docker Images
All except CLI
Anything else?
Related Discord Message
https://discord.com/channels/910287105714954251/910299290230997003/1381686793522380850
Setting a non-zero value for the start_period would be nice to have.
It's hard to say what would be a good value to use from the get-go, would this be something where we would have an environment variable such as HEALTHCHECK_START_PERIOD to customise it? Would that even be customisable once the image is built?
Alternatively, it might require some playing around with the other healthcheck settings too.
One suggestion for new defaults could perhaps be:
HEALTHCHECK --interval=10s --timeout=3s --retries=5 --start-period=10s
This gives a 10s grace period, keeps the 3s timeout the same, but also stretches out the check intervals to every 10 seconds. Additionally, it would now be 5 attempts to verify container health instead of 3. I think this would be a happy in-between for setting a short and snappy start period while still making sure a slower startup stil has time to actually start.
It would, however, mean that a container is only considered unhealthy after 5 consecutive failed checks, which would end up being around 1 minute total. Which means if you have any alerts based on container health, then you might have a 1 minute wait by default before that alert would come through instead of 15-20 seconds.
EDIT: another good convar to include here explicitly is --start-interval=DURATION. This is the interval between checks during the startup grace period of a container. You could potentially set your parameters like so:
HEALTHCHECK --start-period=60s --start-interval=3s --interval=10s --timeout=3s --retries=3
We can set a longer start period while having the startup check interval to be shorter, so once the container is live and accepting requests, it'll be marked as healthy within 3 seconds. Knowing this, we can keep our standard healthcheck interval at 10 seconds, and set our retries back to 3. This would mean around 30 seconds before a previously healthy container gets marked as unhealthy, and a container has ~90s to be marked as unhealthy from boot.
The solution proposed in #547 has been merged and will be made available in v4.0 🥳
- #283