compose [BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy

Description

It seems like occasionally when a container is unhealthy on start and restarts and becomes healthy, the docker compose up -d --wait will fail with an unhealthy error message. This happens when docker compose up -d --wait is run in parallel, and with the policy restart: unless-stopped. Note that this occasionally happens, not all the time.

I would hope that even if the container is unhealthy and crashes on start, --wait will account for this as it eventually becomes healthy after restarting itself if it is within the timeout period.

Steps To Reproduce

I have 3 config files like so:

docker-compose-redis:

services:
  redis:
    image: ghcr.io/getsentry/image-mirror-library-redis:5.0-alpine
    healthcheck:
      test: redis-cli ping | grep PONG
      interval: 5s
      timeout: 5s
      retries: 3
    command:
      [
        'redis-server',
        '--appendonly',
        'yes',
        '--save',
        '60',
        '20',
        '--auto-aof-rewrite-percentage',
        '100',
        '--auto-aof-rewrite-min-size',
        '64mb',
      ]
    ports:
      - 127.0.0.1:6379:6379
    volumes:
      - redis-data:/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  redis-data:

docker-compose-kafka:

services:
  kafka:
    image: ghcr.io/getsentry/image-mirror-confluentinc-cp-kafka:7.5.0
    healthcheck:
      test: kafka-topics --bootstrap-server 127.0.0.1:9092 --list
      interval: 5s
      timeout: 5s
      retries: 3
    environment:
      # https://docs.confluent.io/platform/current/installation/docker/config-reference.html#cp-kakfa-example
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: [email protected]:29093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_NODE_ID: 1001
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,INTERNAL://0.0.0.0:9093,EXTERNAL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:29092,INTERNAL://kafka:9093,EXTERNAL://127.0.0.1:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: 1
      KAFKA_LOG_RETENTION_HOURS: 24
      KAFKA_MESSAGE_MAX_BYTES: 50000000 # 50MB or bust
      KAFKA_MAX_REQUEST_SIZE: 50000000 # 50MB on requests apparently too
      CONFLUENT_SUPPORT_METRICS_ENABLE: false
      KAFKA_LOG4J_LOGGERS: kafka.cluster=WARN,kafka.controller=WARN,kafka.coordinator=WARN,kafka.log=WARN,kafka.server=WARN,state.change.logger=WARN
      KAFKA_LOG4J_ROOT_LOGLEVEL: WARN
      KAFKA_TOOLS_LOG4J_LOGLEVEL: WARN
    ulimits:
      nofile:
        soft: 4096
        hard: 4096
    ports:
      - 127.0.0.1:9092:9092
      - 127.0.0.1:9093:9093
    volumes:
      - kafka-data:/var/lib/kafka/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:

docker-compose-relay:

services:
  relay:
    image: us-central1-docker.pkg.dev/sentryio/relay/relay:nightly
    ports:
      - 127.0.0.1:7899:7899
    command: [run, --config, /etc/relay]
    healthcheck:
      test: curl -f http://127.0.0.1:7899/api/relay/healthcheck/live/
      interval: 5s
      timeout: 5s
      retries: 3
    volumes:
      - ./config/relay.yml:/etc/relay/config.yml
      - ./config/devservices-credentials.json:/etc/relay/credentials.json
    extra_hosts:
      - host.docker.internal:host-gateway
    networks:
      - devservices
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:
  redis-data:

When I run

# Start up commands in parallel
docker compose -p redis -f docker-compose-redis.yml up redis -d --wait > redis_up.log 2>&1 &
kafka_pid=$!
docker compose -p kafka -f docker-compose-kafka.yml up kafka -d --wait > kafka_up.log 2>&1 &
redis_pid=$!
docker compose -p relay -f docker-compose-relay.yml up relay -d --wait > relay_up.log 2>&1 &
relay_pid=$!

# Wait for all up commands to complete
wait $kafka_pid $redis_pid $relay_pid

Relay sometimes fails the to come up with the --wait flag, even if the docker status is technically healthy.

Logs:

Container relay-relay-1  Creating
 Container relay-relay-1  Created
 Container relay-relay-1  Starting
 Container relay-relay-1  Started
 Container relay-relay-1  Waiting
container relay-relay-1 is unhealthy

Compose Version

2.29.7

Docker Environment

Client:
 Version:    27.2.0
 Context:    colima

Anything else?

Let me know if there is anything else I can add to help out when reproducing the issue. The contents of the relay configs can be found here: https://github.com/getsentry/relay/tree/fe3f09fd3accd2361887dd678dbe034f25139fce/devservices/config

Dec 30 '24 22:12 hubertdeng123

Compose polls engine API to check container reach "healthy" state. But if it detects a container crash, I would not expect it silently ignores and let container restart. IMHO the bug you describe should have the opposite fix: Compose should always detect container crashed then at least warn user or stop

Jan 10 '25 11:01 ndeloof

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 11 '25 00:06 github-actions[bot]