testcontainers-node icon indicating copy to clipboard operation
testcontainers-node copied to clipboard

Potential regression due to healthcheck wait strategy change?

Open jalaziz opened this issue 1 year ago • 16 comments

Expected Behaviour

We have a docker compose stack that we manage through testcontainers. Previous to v10.11.0, the stack came up normally.

Actual Behaviour

With v10.11.0, the stack fails to come up with a health check timeout.

Testcontainer Logs

Error: Health check not healthy after 60000ms

    at Timeout._onTimeout (/home/runner/work/monorepo/monorepo/node_modules/.pnpm/[email protected]/node_modules/testcontainers/src/wait-strategies/health-check-wait-strategy.ts:18:16)

Steps to Reproduce

  1. Setup a docker compose file with health checks.
  2. Call await new DockerComposeEnvironment(composeFilePath, 'docker-compose.yml').withWaitStrategy('foo-1', Wait.forHealthCheck()).up();
  3. Observe the timeout.

Environment Information

  • Operating System:
  • Docker Version:
  • Node version:
  • Testcontainers version: v10.11.0

jalaziz avatar Jul 30 '24 18:07 jalaziz

I'm still investigating this, but filing this a bit prematurely in case it's a known issue.

jalaziz avatar Jul 30 '24 18:07 jalaziz

Hi @jalaziz, thanks for raising.

I am guessing that the health event emitted isn't in an expected format. Which container runtime and version are you using?

Could you please run docker events --filter 'event=health_status' and then run a test that uses the health check wait strategy and share the results here? Here is an example output for me when using Docker Desktop:

2024-07-30T19:48:59.597026589+01:00 container health_status: healthy 87d2f1f9bf9d989e9558eb8ca0e31d6bfe962a8824b71be94218fb84ad749fa3 (image=localhost/d6cef552d629:ed5344c0de7a, name=gallant_fermat, org.testcontainers=true, org.testcontainers.lang=node, org.testcontainers.session-id=dd904877cd97, org.testcontainers.version=10.10.4)

Note specifically health_status: healthy.

cristianrgreco avatar Jul 30 '24 18:07 cristianrgreco

Sorry for the delay. What's interesting is that I can't reproduce this issue locally, but I can on GitHub Actions.

I compared docker events --filter 'event=health_status' both locally and from the workflow output and they are nearly identical....

Local example:

2024-07-31T01:42:05.349003000-07:00 container health_status: healthy ab974410cb1bf9f9021fd1c359b56386df7a237e62727a01f388123e1950c074 (com.docker.compose.config-hash=39cf13a37d278ef5d5ecc309b03da7f6bc3bbf96cfee3f488d3dffd944a89031, com.docker.compose.container-number=1, com.docker.compose.depends_on=kafka:service_healthy:false,redis:service_healthy:false, com.docker.compose.image=sha256:82fc7c31927579d70f3f1cdd935dd5ef49917c24ce85b43131131e323fa60000, com.docker.compose.oneoff=False, com.docker.compose.project=testcontainers-adabb482d4d7, com.docker.compose.project.config_files=/Users/jalaziz/Development/foo/monorepo/docker-compose.yml, com.docker.compose.project.working_dir=/Users/jalaziz/Development/foo/monorepo, com.docker.compose.service=risk, com.docker.compose.version=2.29.1, image=testcontainers-adabb482d4d7-risk, name=testcontainers-adabb482d4d7-risk-1)

GitHub Actions example:

2024-08-01T19:02:29.234771961Z container health_status: healthy c3cb620da7e3e2282b2e94576e1cacdafebad2c27cc0b7c18f18d9e47c1ccf61 (com.docker.compose.config-hash=39cf13a37d278ef5d5ecc309b03da7f6bc3bbf96cfee3f488d3dffd944a89031, com.docker.compose.container-number=1, com.docker.compose.depends_on=kafka:service_healthy:false,redis:service_healthy:false, com.docker.compose.image=sha256:82fc7c31927579d70f3f1cdd935dd5ef49917c24ce85b43131131e323fa60000, com.docker.compose.oneoff=False, com.docker.compose.project=testcontainers-2f677b00ed09, com.docker.compose.project.config_files=/home/runner/work/monorepo/monorepo/docker-compose.yml, com.docker.compose.project.working_dir=/home/runner/work/monorepo/monorepo, com.docker.compose.service=risk, com.docker.compose.version=2.27.2, image=testcontainers-2f677b00ed09-risk, name=testcontainers-2f677b00ed09-risk-1)

Here's the runtime info from GHA:

Client: Docker Engine - Community
   Version:           26.1.3
   API version:       1.45
   Go version:        go1.21.10
   Git commit:        b72abbb
   Built:             Thu May [16](https://github.com/bastionplatforms/monorepo/actions/runs/10204147195/job/28232044229?pr=3220#step:4:17) 08:33:29 2024
   OS/Arch:           linux/amd64
   Context:           default

jalaziz avatar Aug 02 '24 08:08 jalaziz

This is still a problem with the latest testcontainers release.

Doing some more debugging, it seems like maybe it's a race condition of sorts? I see that the containers become healthy well before testcontainers begins checking the event stream.

Here's the relevant logs with timestamps:

2024-09-24T06:23:14.500Z testcontainers:compose Container testcontainers-353d31fc9278-risk-1  Creating
2024-09-24T06:23:14.534Z testcontainers:compose Container testcontainers-353d31fc9278-risk-1  Created
2024-09-24T06:23:26.682Z testcontainers:compose Container testcontainers-353d31fc9278-risk-1  Starting
2024-09-24T06:23:26.995Z testcontainers:compose Container testcontainers-353d31fc9278-risk-1  Started
2024-09-24T06:23:26.996Z testcontainers:compose Container testcontainers-353d31fc9278-risk-1  Waiting
2024-09-24T06:23:26.996Z testcontainers:compose Container testcontainers-353d31fc9278-risk-1  Waiting
2024-09-24T06:23:28.496Z testcontainers:compose Container testcontainers-353d31fc9278-risk-1  Healthy
2024-09-24T06:23:28.497Z testcontainers:compose Container testcontainers-353d31fc9278-risk-1  Healthy
2024-09-24T06:23:38.555Z testcontainers [DEBUG] [dd5521f308bf] Getting container by ID...
2024-09-24T06:23:38.555Z testcontainers [DEBUG] [dd5521f308bf] Got container by ID
2024-09-24T06:23:38.555Z testcontainers [DEBUG] [dd5521f308bf] Inspecting container...
2024-09-24T06:23:38.568Z testcontainers [DEBUG] [dd5521f308bf] Inspected container
2024-09-24T06:23:38.568Z testcontainers [DEBUG] [dd5521f308bf] Fetching container logs...
2024-09-24T06:23:38.570Z testcontainers [DEBUG] [dd5521f308bf] Demuxing stream...
2024-09-24T06:23:38.570Z testcontainers [DEBUG] [dd5521f308bf] Demuxed stream
2024-09-24T06:23:38.570Z testcontainers [DEBUG] [dd5521f308bf] Fetched container logs
2024-09-24T06:23:38.570Z testcontainers [DEBUG] [dd5521f308bf] Waiting for container to be ready...
2024-09-24T06:23:38.570Z testcontainers [DEBUG] [dd5521f308bf] Waiting for health check...
2024-09-24T06:23:38.570Z testcontainers [DEBUG] [dd5521f308bf] Fetching event stream...
2024-09-24T06:23:38.668Z testcontainers [DEBUG] [dd5521f308bf] Fetched event stream...
2024-09-24T06:24:38.669Z testcontainers [ERROR] [dd5521f308bf] Health check not healthy after 60000ms
2024-09-24T06:24:38.669Z testcontainers [ERROR] [dd5521f308bf] Container failed to be ready: Error: Health check not healthy after 60000ms
2024-09-24T06:24:38.669Z testcontainers [DEBUG] [dd5521f308bf] Stopping container...
2024-09-24T06:24:39.041Z testcontainers [DEBUG] [dd5521f308bf] Stopped container
2024-09-24T06:24:39.041Z testcontainers [DEBUG] [dd5521f308bf] Removing container...
2024-09-24T06:24:39.054Z testcontainers [DEBUG] [dd5521f308bf] Removed container

jalaziz avatar Sep 24 '24 23:09 jalaziz

It gets even more annoying. Docker documentation suggests that there is a backlog of 1000 events, but in reality, the backlog is only 256 events.

If testcontainers doesn't subscribe in time, it will miss the healtcheck events.

It seems that using events for healthchecks and docker compose could only work if you subscribe to events before spinning up containers with compose.

The current logic makes this tricky though as it requires a container ID. The API does support subscribing to a container name though, so it's theoretically possible to subscribe before starting the containers. However, it seems like it would be a pretty big lift to change health checks to subscribe to events before containers are started.

jalaziz avatar Sep 25 '24 00:09 jalaziz

@cristianrgreco sorry for pinging you, just raising this to your attention because it does seem like a regression based on my finding so far.

jalaziz avatar Sep 26 '24 15:09 jalaziz

Could you provide some repro @jalaziz ? If the issue is that the event stream starts after the container is already healthy, I guess it can be easily fixed by providing an earlier since as documented here https://docs.docker.com/reference/api/engine/version/v1.37/#tag/System/operation/SystemEvents, but without some repro I can't confirm that that would be the solution. I've run many builds which use the updated health check wait strategy and not seen any issue, both in CI and locally on my Mac.

cristianrgreco avatar Sep 26 '24 17:09 cristianrgreco

I will try to work on a repro, but so far it's only consistent on GitHub Actions.

An earlier since won't help, unfortunately. The issue is events are stored in a rotating buffer and the buffer only stores 256 events at a time. So if you're subscribing late and many events were generated since the start, you will never see the event. Another way to think about this is unless you subscribe to events before containers start, you always risk missing events in a system with many containers.

When using a docker compose containing many images, it's not unusual to have many events being generated, which is what I believe is going on here.

jalaziz avatar Sep 26 '24 17:09 jalaziz

I think I can replicate this same issue using bun.sh as a runtime (possibly because of how much faster it is than node?) while trying to use testcontainers in a test setup

I'm simply running this file with bun and the container setup will fail the second or third time (once the image has been fetched) and running it with DEBUG=testcontainers* reveals the same issue that it cannot find the Started! log despite the fact that I can verify it manually

import { PostgreSqlContainer } from '@testcontainers/postgresql';

await new PostgreSqlContainer('postgres:17.0').start();

effervescentia avatar Oct 01 '24 02:10 effervescentia

Instead of observing the even stream for the health event, we can query the health status of the container. It is how we implemented DockerHealthcheckWaitStrategy in tc-java.

kiview avatar Oct 01 '24 12:10 kiview

It was previously implemented like that @kiview. I was hoping instead of polling we could instead get notified when it's ready. However looks like we can't make any guarantees based on the event stream. I'll likely have to rollback this change

cristianrgreco avatar Oct 01 '24 13:10 cristianrgreco

I'm seeing the same behavior in a Bun project - here's an simple way to reproduce the issue:

import { beforeAll } from "bun:test";
import { type StartedTestContainer } from "testcontainers";
import { PostgreSqlContainer } from "@testcontainers/postgresql";

let postgresContainer: StartedTestContainer;

beforeAll(async () => {
  const postgresDefinition = new PostgreSqlContainer("postgres");
  postgresContainer = await postgresDefinition.start();
})

// Tests using the container here

It looks like Bun Test exacerbates the issue, since it's also very fast (compared to the Jest testing framework most JS/TS projects use).

Does the health check wait strategy suffer from the same issue as the querying of the event stream? Or does the health check wait strategy run on an interval (similar to Compose) so it's independent of container startup timing? If the health check query works independent of runtime speed, then it should be possible to mitigate the flakiness of tests using a health check wait strategy in the meantime 😄

andersfischernielsen avatar Oct 11 '24 07:10 andersfischernielsen

@andersfischernielsen It seems I just encountered a similar issue when executing a very basic script through Bun:

import { PostgreSqlContainer } from "@testcontainers/postgresql";

const container = await new PostgreSqlContainer().start();
console.log(container.getHost(), container.getPort());
console.log(container.getConnectionUri());

Will fail with:

error: Log message "/.+ Started!/" not received after 60000ms
      at /Users/aaroncordova/repos/tcc/node_modules/testcontainers/build/wait-strategies/log-wait-strategy.js:27:24

But that seems a different regression, than the one regarding the healthcheck implementation (since this is about the log-wait-strategy within Ryuk).

kiview avatar Oct 11 '24 12:10 kiview

That error looks very similar to what I saw.

I eventually resolved the not received after 60000ms error with this, in case it helps anyone:

const postgresDefinition = new GenericContainer("postgres")
    .withHealthCheck({
      test: ["CMD-SHELL", "pg_isready -U postgres"],
      interval: 1000,
      timeout: 3000,
      retries: 5,
    })
    .withWaitStrategy(Wait.forHealthCheck());

andersfischernielsen avatar Oct 13 '24 20:10 andersfischernielsen

@andersfischernielsen Interesting, since I did not think that this error came from the actual Postgres container (it was never created for me), but from Ryuk.

If it is indeed Ryuk, disabling it can also help:

TESTCONTAINERS_RYUK_DISABLED=true

kiview avatar Oct 14 '24 14:10 kiview

@eddumelendez found this behavior in Bun, which sounds very related: https://github.com/oven-sh/bun/issues/13696

kiview avatar Oct 14 '24 15:10 kiview

Hello, do you plan to rollback to the polling version soon? We are experiencing the same issue as @jalaziz in the Gitlab CI/CD pipeline.

vborza avatar Nov 04 '24 14:11 vborza

@andersfischernielsen Interesting, since I did not think that this error came from the actual Postgres container (it was never created for me), but from Ryuk.

If it is indeed Ryuk, disabling it can also help:

TESTCONTAINERS_RYUK_DISABLED=true

I just found this very issue while running tests in Bun; disabling Ryuk seems to fix the issue for me.

lacion avatar Nov 14 '24 11:11 lacion