docker-php icon indicating copy to clipboard operation
docker-php copied to clipboard

Increase start_period for default healthcheck

Open jaydrogers opened this issue 6 months ago • 1 comments

Background

During zero-downtime deployments, sometime the service fails to start:

[09-Jun-2025 17:16:27] NOTICE: fpm is running, pid 131
[09-Jun-2025 17:16:27] NOTICE: ready to handle connections curl: (7) Failed to connect to localhost port 8080 after 1 ms: Could not connect to server ❌ There seems to be a failure in checking the NGINX + PHP-FPM. curl: (7) Failed to connect to localhost port 8080 after 1 ms: Could not connect to server HTTP Status Code: 000 ::1 - -
[09/Jun/2025:17:16:30 +0000] "GET /up HTTP/2.0" 200 1936 "-" "curl/8.12.1" "-" ✅ NGINX + PHP-FPM is running correctly. ::1 - - 
[09/Jun/2025:17:16:33 +0000] "GET /up HTTP/2.0" 200 1936 "-" "curl/8.12.1" "-" ::1 - - 
[09/Jun/2025:17:16:38 +0000] "GET /up HTTP/2.0" 200 1936 "-" "curl/8.12.1" "-"

This can mainly be from the Laravel Auto-runs.

From a user on Discord:

did more digging, finally got a deployment to work, but now I've ran into an old issue that subsequent deployments will always fail and then get rolled back

the container typically stays alive for about 30 seconds, gets through most of the Laravel automations, and sometimes gives the FPM error I shown above before being reported as a failure and being rolled back

I am still at a loss as to what's causing this, at first I thought it could be fpm printing to stderr, but that wasn't the case

it even sometimes starts giving successful health checks before it gets rolled back

docker inspect implies the container is exiting with code 137 (SIGTERM?)

Problem

Proposed solution

  • Set the start_period to 30s

From the official docs:

start period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries. However, if a health check succeeds during the start period, the container is considered started and all consecutive failures will be counted towards the maximum number of retries.

Outcome

What did you expect?

  • If AUTORUN_ENABLED is true, we should give enough time for these services to start.

What happened instead?

  • The container shuts down

Affected Docker Images

All except CLI

Anything else?

Related Discord Message

https://discord.com/channels/910287105714954251/910299290230997003/1381686793522380850

jaydrogers avatar Jun 10 '25 19:06 jaydrogers

Setting a non-zero value for the start_period would be nice to have.

It's hard to say what would be a good value to use from the get-go, would this be something where we would have an environment variable such as HEALTHCHECK_START_PERIOD to customise it? Would that even be customisable once the image is built?

Alternatively, it might require some playing around with the other healthcheck settings too.

One suggestion for new defaults could perhaps be:

HEALTHCHECK --interval=10s --timeout=3s --retries=5 --start-period=10s

This gives a 10s grace period, keeps the 3s timeout the same, but also stretches out the check intervals to every 10 seconds. Additionally, it would now be 5 attempts to verify container health instead of 3. I think this would be a happy in-between for setting a short and snappy start period while still making sure a slower startup stil has time to actually start.

It would, however, mean that a container is only considered unhealthy after 5 consecutive failed checks, which would end up being around 1 minute total. Which means if you have any alerts based on container health, then you might have a 1 minute wait by default before that alert would come through instead of 15-20 seconds.

EDIT: another good convar to include here explicitly is --start-interval=DURATION. This is the interval between checks during the startup grace period of a container. You could potentially set your parameters like so:

HEALTHCHECK --start-period=60s --start-interval=3s --interval=10s --timeout=3s --retries=3

We can set a longer start period while having the startup check interval to be shorter, so once the container is live and accepting requests, it'll be marked as healthy within 3 seconds. Knowing this, we can keep our standard healthcheck interval at 10 seconds, and set our retries back to 3. This would mean around 30 seconds before a previously healthy container gets marked as unhealthy, and a container has ~90s to be marked as unhealthy from boot.

aSeriousDeveloper avatar Jun 10 '25 20:06 aSeriousDeveloper

The solution proposed in #547 has been merged and will be made available in v4.0 🥳

  • #283

jaydrogers avatar Oct 02 '25 15:10 jaydrogers