testcontainers-go fix: logging deadlock

What does this PR do?

Fix logging deadlock, causing lots of test timeouts.

This refactors how logging shutdown is handled, eliminating unnecessary captures, use idiomatic wait group to signal processor completion and remove unnecessary nil initialisation.

Fix race condition in log testing which was reading Msg while the processor was still running.

Switch to checking GITHUB_RUN_ID environment variable to detect GitHub as XDG_RUNTIME_DIR can be present in other situations.

Why is it important?

Tests and runtime is currently flakey due to race condition during start and stop of logging.

How to test this PR

Testing is hard as it's fixing a random race condition, but tests should now be more reliable.

Mar 12 '24 23:03 stevenh

Deploy Preview for testcontainers-go ready!

Name	Link
Latest commit	35cb59d38cf6bc741bd2f6d80eea0f68206f09d9
Latest deploy log	https://app.netlify.com/sites/testcontainers-go/deploys/660da0524323120008414c0e
Deploy Preview	https://deploy-preview-2346--testcontainers-go.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Mar 12 '24 23:03 netlify[bot]

Looks like 95% of tests are in the green, Test with reaper off seems to be a common failure, but unlike other runs is now just an issue with dangling containers, tests themselves are passing, is that a know issue?

Mar 13 '24 09:03 stevenh

Looks like 95% of tests are in the green, Test with reaper off seems to be a common failure, but unlike other runs is now just an issue with dangling containers, tests themselves are passing, is that a know issue?

I've taken a deep look into this error, and as you pointed out, it seems consistent for the reaper-off mode: so there are containers that are not manually cleaned by the tests. And the test that leaks that container is building it from a Dockerfile. So it makes me think the culprit is Test_MultiContainerLogConsumer_CancelledContext, not killing the first container that is created.

Could you please take a look? 🙏

Mar 25 '24 15:03 mdelapenya

@mdelapenya @stevenh Not sure if this is helpful, but I ran into this issue (linked above) and debugged it.

The multi-consumer aspect didn't have anything to do with it. I could replicate this with a single consumer. I can force the bug about 1/15 times.

Here is the flow

https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L726 This condition is met. We hit the GOTO
https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L710 This condition matches, we return. I don't have the error off hand by I believe it was context cancelled
https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L686 Defer runs
https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L791 There is no longer a consumer for stop. This deadlocks here, as that channel is unbuffered.

Mar 25 '24 16:03 bearrito

@mdelapenya @stevenh Not sure if this is helpful, but I ran into this issue (linked above) and debugged it.

The multi-consumer aspect didn't have anything to do with it. I could replicate this with a single consumer. I can force the bug about 1/15 times.

Here is the flow

https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L726 This condition is met. We hit the GOTO

https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L710 This condition matches, we return. I don't have the error off hand by I believe it was context cancelled

https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L686 Defer runs

https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L791 There is no longer a consumer for stop. This deadlocks here, as that channel is unbuffered.

That seems to be an analysis on the old code, that shouldn't be possible with the new version?

Apr 02 '24 21:04 stevenh

@stevenh Probably not, I was more or less replying to https://github.com/testcontainers/testcontainers-go/pull/2346#issuecomment-2018211902.

I was also hitting this in CI and wondered what was happening. I think your waitGroup approach should work.

If you like, I can commit my reduced test case where I was able to replicate this ( at least with high-probability), then test against that?

Apr 02 '24 21:04 bearrito

The Test with reaper off looks like a secondary issue.

What's happening is the test cancels the main context causing the logger to stop. This returns context cancelled error which is subsequently returned by stopLogProduction in terminatingHook during Test_MultiContainerLogConsumer_CancelledContext deferred call to c.Terminate and c2.Terminate which prevents Terminate from cleaning up properly as it returns early.

There's a few ways to fix:

Don't return context cancelled error from stopLogProduction.
Don't return context cancelled error from defaultLogConsumersHook.PreTerminates.
Ignore context cancelled errors from hooks in Terminate.
Perform best effort clean up in Terminate and return a multi error if needed.

It feels to me like 4. would be the best solution as it puts the caller in control of the behaviour.

A tertiary issue then becomes should all lifecycleHooks iterators e.g. terminatedHook and hence containerHookFn also be best effort?

Thoughts?

Apr 02 '24 22:04 stevenh

@stevenh Probably not, I was more or less replying to #2346 (comment).

I was also hitting this in CI and wondered what was happening. I think your waitGroup approach should work.

best-effort.patch

If you like, I can commit my reduced test case where I was able to replicate this ( at least with high-probability), then test against that?

If you have a reproduction and want to try with this branch + the attached patch which makes lifecycles and Terminate best effort that would be helpful to confirm it fixes your case too.

Apr 02 '24 23:04 stevenh

I like the best-effort patch for the lifecycle, which will make the potential errors more evident. For the terminate method, you want to continue with the terminated hook even if we failed the removal of the container, right?

Apr 03 '24 11:04 mdelapenya

I like the best-effort patch for the lifecycle, which will make the potential errors more evident. For the terminate method, you want to continue with the terminated hook even if we failed the removal of the container, right?

Yes that makes sense to me, I have a version of this from last night I'll tweak and put up see what folks think.

Apr 03 '24 15:04 stevenh

testcontainers-go testcontainers-go copied to clipboard

fix: logging deadlock

What does this PR do?

Why is it important?

How to test this PR

✅ Deploy Preview for testcontainers-go ready!

testcontainers-go
testcontainers-go copied to clipboard

Deploy Preview for testcontainers-go ready!