[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy
Description
It seems like occasionally when a container is unhealthy on start and restarts and becomes healthy, the docker compose up -d --wait will fail with an unhealthy error message. This happens when docker compose up -d --wait is run in parallel, and with the policy restart: unless-stopped. Note that this occasionally happens, not all the time.
I would hope that even if the container is unhealthy and crashes on start, --wait will account for this as it eventually becomes healthy after restarting itself if it is within the timeout period.
Steps To Reproduce
I have 3 config files like so:
docker-compose-redis:
services:
redis:
image: ghcr.io/getsentry/image-mirror-library-redis:5.0-alpine
healthcheck:
test: redis-cli ping | grep PONG
interval: 5s
timeout: 5s
retries: 3
command:
[
'redis-server',
'--appendonly',
'yes',
'--save',
'60',
'20',
'--auto-aof-rewrite-percentage',
'100',
'--auto-aof-rewrite-min-size',
'64mb',
]
ports:
- 127.0.0.1:6379:6379
volumes:
- redis-data:/data
networks:
- devservices
extra_hosts:
- host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
labels:
- orchestrator=devservices
restart: unless-stopped
networks:
devservices:
external: true
volumes:
redis-data:
docker-compose-kafka:
services:
kafka:
image: ghcr.io/getsentry/image-mirror-confluentinc-cp-kafka:7.5.0
healthcheck:
test: kafka-topics --bootstrap-server 127.0.0.1:9092 --list
interval: 5s
timeout: 5s
retries: 3
environment:
# https://docs.confluent.io/platform/current/installation/docker/config-reference.html#cp-kakfa-example
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_CONTROLLER_QUORUM_VOTERS: [email protected]:29093
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_NODE_ID: 1001
CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,INTERNAL://0.0.0.0:9093,EXTERNAL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:29092,INTERNAL://kafka:9093,EXTERNAL://127.0.0.1:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: 1
KAFKA_LOG_RETENTION_HOURS: 24
KAFKA_MESSAGE_MAX_BYTES: 50000000 # 50MB or bust
KAFKA_MAX_REQUEST_SIZE: 50000000 # 50MB on requests apparently too
CONFLUENT_SUPPORT_METRICS_ENABLE: false
KAFKA_LOG4J_LOGGERS: kafka.cluster=WARN,kafka.controller=WARN,kafka.coordinator=WARN,kafka.log=WARN,kafka.server=WARN,state.change.logger=WARN
KAFKA_LOG4J_ROOT_LOGLEVEL: WARN
KAFKA_TOOLS_LOG4J_LOGLEVEL: WARN
ulimits:
nofile:
soft: 4096
hard: 4096
ports:
- 127.0.0.1:9092:9092
- 127.0.0.1:9093:9093
volumes:
- kafka-data:/var/lib/kafka/data
networks:
- devservices
extra_hosts:
- host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
labels:
- orchestrator=devservices
restart: unless-stopped
networks:
devservices:
external: true
volumes:
kafka-data:
docker-compose-relay:
services:
relay:
image: us-central1-docker.pkg.dev/sentryio/relay/relay:nightly
ports:
- 127.0.0.1:7899:7899
command: [run, --config, /etc/relay]
healthcheck:
test: curl -f http://127.0.0.1:7899/api/relay/healthcheck/live/
interval: 5s
timeout: 5s
retries: 3
volumes:
- ./config/relay.yml:/etc/relay/config.yml
- ./config/devservices-credentials.json:/etc/relay/credentials.json
extra_hosts:
- host.docker.internal:host-gateway
networks:
- devservices
labels:
- orchestrator=devservices
restart: unless-stopped
networks:
devservices:
external: true
volumes:
kafka-data:
redis-data:
When I run
# Start up commands in parallel
docker compose -p redis -f docker-compose-redis.yml up redis -d --wait > redis_up.log 2>&1 &
kafka_pid=$!
docker compose -p kafka -f docker-compose-kafka.yml up kafka -d --wait > kafka_up.log 2>&1 &
redis_pid=$!
docker compose -p relay -f docker-compose-relay.yml up relay -d --wait > relay_up.log 2>&1 &
relay_pid=$!
# Wait for all up commands to complete
wait $kafka_pid $redis_pid $relay_pid
Relay sometimes fails the to come up with the --wait flag, even if the docker status is technically healthy.
Logs:
Container relay-relay-1 Creating
Container relay-relay-1 Created
Container relay-relay-1 Starting
Container relay-relay-1 Started
Container relay-relay-1 Waiting
container relay-relay-1 is unhealthy
Compose Version
2.29.7
Docker Environment
Client:
Version: 27.2.0
Context: colima
Anything else?
Let me know if there is anything else I can add to help out when reproducing the issue. The contents of the relay configs can be found here: https://github.com/getsentry/relay/tree/fe3f09fd3accd2361887dd678dbe034f25139fce/devservices/config
Compose polls engine API to check container reach "healthy" state. But if it detects a container crash, I would not expect it silently ignores and let container restart. IMHO the bug you describe should have the opposite fix: Compose should always detect container crashed then at least warn user or stop
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.