testcontainers-go
testcontainers-go copied to clipboard
fix: logging deadlock
What does this PR do?
Fix logging deadlock, causing lots of test timeouts.
This refactors how logging shutdown is handled, eliminating unnecessary captures, use idiomatic wait group to signal processor completion and remove unnecessary nil initialisation.
Fix race condition in log testing which was reading Msg while the processor was still running.
Switch to checking GITHUB_RUN_ID environment variable to detect GitHub as XDG_RUNTIME_DIR can be present in other situations.
Why is it important?
Tests and runtime is currently flakey due to race condition during start and stop of logging.
How to test this PR
Testing is hard as it's fixing a random race condition, but tests should now be more reliable.
Deploy Preview for testcontainers-go ready!
| Name | Link |
|---|---|
| Latest commit | 35cb59d38cf6bc741bd2f6d80eea0f68206f09d9 |
| Latest deploy log | https://app.netlify.com/sites/testcontainers-go/deploys/660da0524323120008414c0e |
| Deploy Preview | https://deploy-preview-2346--testcontainers-go.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Looks like 95% of tests are in the green, Test with reaper off seems to be a common failure, but unlike other runs is now just an issue with dangling containers, tests themselves are passing, is that a know issue?
Looks like 95% of tests are in the green,
Test with reaper offseems to be a common failure, but unlike other runs is now just an issue with dangling containers, tests themselves are passing, is that a know issue?
I've taken a deep look into this error, and as you pointed out, it seems consistent for the reaper-off mode: so there are containers that are not manually cleaned by the tests. And the test that leaks that container is building it from a Dockerfile. So it makes me think the culprit is Test_MultiContainerLogConsumer_CancelledContext, not killing the first container that is created.
Could you please take a look? 🙏
@mdelapenya @stevenh Not sure if this is helpful, but I ran into this issue (linked above) and debugged it.
The multi-consumer aspect didn't have anything to do with it. I could replicate this with a single consumer. I can force the bug about 1/15 times.
Here is the flow
- https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L726 This condition is met. We hit the GOTO
- https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L710 This condition matches, we return. I don't have the error off hand by I believe it was
context cancelled - https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L686 Defer runs
- https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L791 There is no longer a consumer for stop. This deadlocks here, as that channel is unbuffered.
@mdelapenya @stevenh Not sure if this is helpful, but I ran into this issue (linked above) and debugged it.
The multi-consumer aspect didn't have anything to do with it. I could replicate this with a single consumer. I can force the bug about 1/15 times.
Here is the flow
- https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L726 This condition is met. We hit the GOTO
- https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L710 This condition matches, we return. I don't have the error off hand by I believe it was
context cancelled- https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L686 Defer runs
- https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L791 There is no longer a consumer for stop. This deadlocks here, as that channel is unbuffered.
That seems to be an analysis on the old code, that shouldn't be possible with the new version?
@stevenh Probably not, I was more or less replying to https://github.com/testcontainers/testcontainers-go/pull/2346#issuecomment-2018211902.
I was also hitting this in CI and wondered what was happening. I think your waitGroup approach should work.
If you like, I can commit my reduced test case where I was able to replicate this ( at least with high-probability), then test against that?
The Test with reaper off looks like a secondary issue.
What's happening is the test cancels the main context causing the logger to stop. This returns context cancelled error which is subsequently returned by stopLogProduction in terminatingHook during Test_MultiContainerLogConsumer_CancelledContext deferred call to c.Terminate and c2.Terminate which prevents Terminate from cleaning up properly as it returns early.
There's a few ways to fix:
- Don't return context cancelled error from
stopLogProduction. - Don't return context cancelled error from
defaultLogConsumersHook.PreTerminates. - Ignore context cancelled errors from hooks in
Terminate. - Perform best effort clean up in
Terminateand return a multi error if needed.
It feels to me like 4. would be the best solution as it puts the caller in control of the behaviour.
A tertiary issue then becomes should all lifecycleHooks iterators e.g. terminatedHook and hence containerHookFn also be best effort?
Thoughts?
@stevenh Probably not, I was more or less replying to #2346 (comment).
I was also hitting this in CI and wondered what was happening. I think your waitGroup approach should work.
If you like, I can commit my reduced test case where I was able to replicate this ( at least with high-probability), then test against that?
If you have a reproduction and want to try with this branch + the attached patch which makes lifecycles and Terminate best effort that would be helpful to confirm it fixes your case too.
I like the best-effort patch for the lifecycle, which will make the potential errors more evident. For the terminate method, you want to continue with the terminated hook even if we failed the removal of the container, right?
I like the best-effort patch for the lifecycle, which will make the potential errors more evident. For the terminate method, you want to continue with the terminated hook even if we failed the removal of the container, right?
Yes that makes sense to me, I have a version of this from last night I'll tweak and put up see what folks think.