moby-ryuk
moby-ryuk copied to clipboard
Add HEALTHCHECK to Linux Docker image.
Summary
I ran across an issue using testcontainers in golang, specifically when testcontainers are used by multiple test packages in a single project. Go builds a test binary for each package and runs them simultaneously.
Example project tree, where test binaries for packages a
, b
, c
, and d
will run simultaneously, attempting to use a testcontainer:
.
├── a
│ └── a_test.go
├── b
│ └── b_test.go
├── c
│ └── c_test.go
├── d
│ └── d_test.go
├── go.mod
└── go.sum
One of the test binaries will win the race to create the ryuk container, and it becomes present in the list of docker containers returned by the Docker API.
The problem is that it the ryuk container can be exposed by the Docker API before the container is ready to accept connections. A testcontainer client can attempt to connect to the container before it is accepting connections.
This results in error messages tests like:
2024/04/21 22:25:10 🔥 Reaper obtained from Docker for this test session 99998b629eea57023f49aefc9eda5406c84ce4c91d0c6005c1b8c00dcbbce0d7
2024/04/21 22:25:10 failed to start container: dial tcp 127.0.0.1:32917: connect: connection refused: Connecting to Ryuk on localhost:32917 failed: connecting to reaper failed: failed to create container
Clone this test case repo and run go test -count=1 -v ./...
and observe the errors.
Run go test -count=1 -p 1 -v ./...
and tests should pass.
Prior issues
Race conditions (and the connection refused
error message) have been the topic of previous issues, c.f.
- https://github.com/testcontainers/testcontainers-go/issues/149
- https://github.com/testcontainers/testcontainers-go/issues/764
- https://github.com/testcontainers/testcontainers-go/pull/782
- PR #782 specifically addressed a race condition to create the container.
But this PR addresses the situation described here by adding a simple healthcheck using netstat
(already present in the alpine image).
Future Work
Once healthcheck information is exposed by the ryuk
container, the lookupReaperContainer func can retry until the status is healthy.
I'm not on a Windows platform, and I'm not familiar with nanoserver and whether it comes with Powershell, and whether a command like netstat -an | Select-String 8080
will work. This PR only adds a HEALTHCHECK
for the linux platform.
Unintended Consequences
Adding a healthcheck to ryuk may increase its perceived startup time.
Example Healthcheck Output
Before healthcheck (current state, i.e. main
):
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 25660,
"ExitCode": 0,
"Error": "",
"StartedAt": "2024-04-22T01:27:19.276032615Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
After healthcheck (this PR):
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 1703,
"ExitCode": 0,
"Error": "",
"StartedAt": "2024-04-22T01:46:01.038259025Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2024-04-22T01:46:03.039126401Z",
"End": "2024-04-22T01:46:03.102456526Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2024-04-22T01:46:05.107285902Z",
"End": "2024-04-22T01:46:05.151179319Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2024-04-22T01:46:07.154951153Z",
"End": "2024-04-22T01:46:07.243256403Z",
"ExitCode": 0,
"Output": ""
},
{
"Start": "2024-04-22T01:46:09.247204821Z",
"End": "2024-04-22T01:46:09.319083404Z",
"ExitCode": 0,
"Output": ""
}
]
}
},
Does Go create a single Ryuk instance for all packages?
Yes, exactly.
Does one package connect successfully and the others fail? Does it retry to connect if it fails?
One package typically connects successfully. With the other packages, it's a race: if they get the container before it's ready to accept connections, they fail.
In my testing, all remaining packages fail. But I can imagine in a larger codebase, some packages would succeed.
How does .NET work: is it a single ryuk instance? For example, can you launch two tests in parallel and re-create the behavior (issue)?
With respect to the languages/platforms supported by testcontainers
:
If other platforms are affected by this same issue, then they'll need a patch similar to https://github.com/testcontainers/testcontainers-go/pull/2508, which actually uses the results of the healthcheck before returning a (healthy) ryuk container.
How does .NET work: is it a single ryuk instance? For example, can you launch two tests in parallel and re-create the behavior (issue)?
Thanks for the quick explanation. .NET behaves different here. It creates a Ryuk instance for each test assembly (test project). Test assemblies are kind of isolated from each other. We do not need to deal with synchronization here.
I did encounter some things that I can't explain during my investigation, so I wanted to add them here for full transparency.
TL;DR: I failed to identify which component was denying the connection.
I have an overly-simple mental model of how Docker API+Docker Engine work together with their containers, so these issues could just be my ignorance.
The error message indicates the connection was denied, not a timeout.
So who denied the connection? The Docker engine? It wasn't the ryuk
container (see below).
I did not observe the denied TCP connection
When I was troubleshooting, I used tcpdump
- on the docker vm itself - to obtain traffic destined for ryuk. I could see successful connections/traffic, but I never saw a three-way handshake attempt that was denied.
Out of curiosity and to understand and compare the behavior (issue) with the .NET implementation: Does Go create a single Ryuk instance for all packages? Does one package connect successfully and the others fail? Does it retry to connect if it fails?
hey @HofmeisterAn, yes, Go creates one Ryuk container per test session, which implies some synchronisation logic to check if the ryuk container is in the running state before creating a new one.
@emetsger for the repro repository, I cloned it and it works for me with both values of -p
. I tried running it with Docker Desktop and Testcontainers Cloud, with the same result:
Using all processors
=== RUN TestA
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker:
Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
API Version: 1.43
Operating System: Ubuntu 22.04.4 LTS
Total Memory: 15779 MB
Resolved Docker Host: tcp://127.0.0.1:62301
Resolved Docker Socket Path: /var/run/docker.sock
Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
Test ProcessID: f234fbe5-4793-4cf1-923b-46edffa8668b
2024/05/09 07:32:17 🐳 Creating container for image testcontainers/ryuk:0.7.0
2024/05/09 07:32:17 ✅ Container created: 783b9d4e2406
2024/05/09 07:32:17 🐳 Starting container: 783b9d4e2406
2024/05/09 07:32:18 ✅ Container started: 783b9d4e2406
2024/05/09 07:32:18 🚧 Waiting for container id 783b9d4e2406 image: testcontainers/ryuk:0.7.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/05/09 07:32:18 🔔 Container is ready: 783b9d4e2406
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: e0cb0246b173
2024/05/09 07:32:18 🐳 Starting container: e0cb0246b173
2024/05/09 07:32:19 ✅ Container started: e0cb0246b173
2024/05/09 07:32:19 🚧 Waiting for container id e0cb0246b173 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003615b8 Strategies:[0x1400037d9b0]}
2024/05/09 07:32:21 🔔 Container is ready: e0cb0246b173
2024/05/09 07:32:21 🐳 Terminating container: e0cb0246b173
2024/05/09 07:32:21 🚫 Container terminated: e0cb0246b173
--- PASS: TestA (4.10s)
PASS
ok tc-debug/a 4.281s
=== RUN TestB
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker:
Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
API Version: 1.43
Operating System: Ubuntu 22.04.4 LTS
Total Memory: 15779 MB
Resolved Docker Host: tcp://127.0.0.1:62301
Resolved Docker Socket Path: /var/run/docker.sock
Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
Test ProcessID: 947a4325-eb63-4695-840c-87c38885c7ba
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: 8dba32be8360
2024/05/09 07:32:18 🐳 Starting container: 8dba32be8360
2024/05/09 07:32:18 ✅ Container started: 8dba32be8360
2024/05/09 07:32:18 🚧 Waiting for container id 8dba32be8360 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003d15d0 Strategies:[0x140003ed9e0]}
2024/05/09 07:32:21 🔔 Container is ready: 8dba32be8360
2024/05/09 07:32:21 🐳 Terminating container: 8dba32be8360
2024/05/09 07:32:21 🚫 Container terminated: 8dba32be8360
--- PASS: TestB (3.79s)
PASS
ok tc-debug/b 4.237s
=== RUN TestC
2024/05/09 07:32:17 github.com/testcontainers/testcontainers-go - Connected to docker:
Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
API Version: 1.43
Operating System: Ubuntu 22.04.4 LTS
Total Memory: 15779 MB
Resolved Docker Host: tcp://127.0.0.1:62301
Resolved Docker Socket Path: /var/run/docker.sock
Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
Test ProcessID: f38b47e6-f595-404a-8cb0-2534fce5916a
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: 614b2727db5c
2024/05/09 07:32:18 🐳 Starting container: 614b2727db5c
2024/05/09 07:32:18 ✅ Container started: 614b2727db5c
2024/05/09 07:32:18 🚧 Waiting for container id 614b2727db5c image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x14000393698 Strategies:[0x140003b39b0]}
2024/05/09 07:32:21 🔔 Container is ready: 614b2727db5c
2024/05/09 07:32:21 🐳 Terminating container: 614b2727db5c
2024/05/09 07:32:21 🚫 Container terminated: 614b2727db5c
--- PASS: TestC (3.88s)
PASS
ok tc-debug/c 4.201s
=== RUN TestD
2024/05/09 07:32:18 github.com/testcontainers/testcontainers-go - Connected to docker:
Server Version: 78+testcontainerscloud (via Testcontainers Desktop 1.11.0)
API Version: 1.43
Operating System: Ubuntu 22.04.4 LTS
Total Memory: 15779 MB
Resolved Docker Host: tcp://127.0.0.1:62301
Resolved Docker Socket Path: /var/run/docker.sock
Test SessionID: af1b8c124505cd4b386767859edf7b742c07386a7d37352650522cfcb2f4e6de
Test ProcessID: 5d324800-b68a-4687-90d1-b1216078c8f9
2024/05/09 07:32:18 🔥 Reaper obtained from Docker for this test session 783b9d4e24063bf41b22795afe79f38754769d21c5852532f25708dc849f72ea
2024/05/09 07:32:18 🐳 Creating container for image docker.io/postgres:16-alpine
2024/05/09 07:32:18 ✅ Container created: a533d8ad9249
2024/05/09 07:32:18 🐳 Starting container: a533d8ad9249
2024/05/09 07:32:18 ✅ Container started: a533d8ad9249
2024/05/09 07:32:18 🚧 Waiting for container id a533d8ad9249 image: docker.io/postgres:16-alpine. Waiting for: &{timeout:<nil> deadline:0x140003f95d0 Strategies:[0x140004659e0]}
2024/05/09 07:32:21 🔔 Container is ready: a533d8ad9249
2024/05/09 07:32:21 🐳 Terminating container: a533d8ad9249
2024/05/09 07:32:21 🚫 Container terminated: a533d8ad9249
--- PASS: TestD (3.83s)
PASS
ok tc-debug/d 4.425s
for the repro repository, I cloned it and it works for me with both values of -p. I tried running it with Docker Desktop and Testcontainers Cloud, with the same result:
@mdelapenya I have the same result with testcontainers-go and Docker Desktop, but when I run Docker with Colima I randomly have the issue when TestContainers wants to connect to Ryuk (95% of the time it triggers the issue)
Pinging @eddumelendez and @kiview as this is a cross-lang feature.
Can we decouple the switch to using a HEALTHCHECK
from fixing the race condition? While I am not against using a HEALTHCHECK
here per-se, I would prefer to first fix the race condition (this should be possible with the current waiting logic, right?). Also, I am fairly certain, that HEALTHCHECK
is not supported by all potential container runtimes (e.g. Podman). While we generally aim for first-class support of Docker, adding a substantial regression for alternative container runtimes might send a bad signal to the community, especially if it can be avoided. And last, we should benchmark the performance impact of switching to HEALTCHECK
as well (if we decide to use it).
@HofmeisterAn I guess we didn't fully sync up on the current approach to test session semantics (and technical measures to achieve them) across languages, back when we introduced this in Go (and AFAIK in Node as well). From a technical PoV, Go and .NET should be in a similar situation.
Closing as per https://github.com/testcontainers/testcontainers-go/pull/2508#issuecomment-2168475784
Thanks!