testcontainers-go icon indicating copy to clipboard operation
testcontainers-go copied to clipboard

fix: logging deadlock

Open stevenh opened this issue 1 year ago • 2 comments

What does this PR do?

Fix logging deadlock, causing lots of test timeouts.

This refactors how logging shutdown is handled, eliminating unnecessary captures, use idiomatic wait group to signal processor completion and remove unnecessary nil initialisation.

Fix race condition in log testing which was reading Msg while the processor was still running.

Switch to checking GITHUB_RUN_ID environment variable to detect GitHub as XDG_RUNTIME_DIR can be present in other situations.

Why is it important?

Tests and runtime is currently flakey due to race condition during start and stop of logging.

How to test this PR

Testing is hard as it's fixing a random race condition, but tests should now be more reliable.

stevenh avatar Mar 12 '24 23:03 stevenh

Deploy Preview for testcontainers-go ready!

Name Link
Latest commit 35cb59d38cf6bc741bd2f6d80eea0f68206f09d9
Latest deploy log https://app.netlify.com/sites/testcontainers-go/deploys/660da0524323120008414c0e
Deploy Preview https://deploy-preview-2346--testcontainers-go.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Mar 12 '24 23:03 netlify[bot]

Looks like 95% of tests are in the green, Test with reaper off seems to be a common failure, but unlike other runs is now just an issue with dangling containers, tests themselves are passing, is that a know issue?

stevenh avatar Mar 13 '24 09:03 stevenh

Looks like 95% of tests are in the green, Test with reaper off seems to be a common failure, but unlike other runs is now just an issue with dangling containers, tests themselves are passing, is that a know issue?

I've taken a deep look into this error, and as you pointed out, it seems consistent for the reaper-off mode: so there are containers that are not manually cleaned by the tests. And the test that leaks that container is building it from a Dockerfile. So it makes me think the culprit is Test_MultiContainerLogConsumer_CancelledContext, not killing the first container that is created.

Could you please take a look? 🙏

mdelapenya avatar Mar 25 '24 15:03 mdelapenya

@mdelapenya @stevenh Not sure if this is helpful, but I ran into this issue (linked above) and debugged it.

The multi-consumer aspect didn't have anything to do with it. I could replicate this with a single consumer. I can force the bug about 1/15 times.

Here is the flow

  1. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L726 This condition is met. We hit the GOTO
  2. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L710 This condition matches, we return. I don't have the error off hand by I believe it was context cancelled
  3. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L686 Defer runs
  4. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L791 There is no longer a consumer for stop. This deadlocks here, as that channel is unbuffered.

bearrito avatar Mar 25 '24 16:03 bearrito

@mdelapenya @stevenh Not sure if this is helpful, but I ran into this issue (linked above) and debugged it.

The multi-consumer aspect didn't have anything to do with it. I could replicate this with a single consumer. I can force the bug about 1/15 times.

Here is the flow

  1. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L726 This condition is met. We hit the GOTO
  2. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L710 This condition matches, we return. I don't have the error off hand by I believe it was context cancelled
  3. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L686 Defer runs
  4. https://github.com/testcontainers/testcontainers-go/blob/main/docker.go#L791 There is no longer a consumer for stop. This deadlocks here, as that channel is unbuffered.

That seems to be an analysis on the old code, that shouldn't be possible with the new version?

stevenh avatar Apr 02 '24 21:04 stevenh

@stevenh Probably not, I was more or less replying to https://github.com/testcontainers/testcontainers-go/pull/2346#issuecomment-2018211902.

I was also hitting this in CI and wondered what was happening. I think your waitGroup approach should work.

If you like, I can commit my reduced test case where I was able to replicate this ( at least with high-probability), then test against that?

bearrito avatar Apr 02 '24 21:04 bearrito

The Test with reaper off looks like a secondary issue.

What's happening is the test cancels the main context causing the logger to stop. This returns context cancelled error which is subsequently returned by stopLogProduction in terminatingHook during Test_MultiContainerLogConsumer_CancelledContext deferred call to c.Terminate and c2.Terminate which prevents Terminate from cleaning up properly as it returns early.

There's a few ways to fix:

  1. Don't return context cancelled error from stopLogProduction.
  2. Don't return context cancelled error from defaultLogConsumersHook.PreTerminates.
  3. Ignore context cancelled errors from hooks in Terminate.
  4. Perform best effort clean up in Terminate and return a multi error if needed.

It feels to me like 4. would be the best solution as it puts the caller in control of the behaviour.

A tertiary issue then becomes should all lifecycleHooks iterators e.g. terminatedHook and hence containerHookFn also be best effort?

Thoughts?

stevenh avatar Apr 02 '24 22:04 stevenh

@stevenh Probably not, I was more or less replying to #2346 (comment).

I was also hitting this in CI and wondered what was happening. I think your waitGroup approach should work.

best-effort.patch

If you like, I can commit my reduced test case where I was able to replicate this ( at least with high-probability), then test against that?

If you have a reproduction and want to try with this branch + the attached patch which makes lifecycles and Terminate best effort that would be helpful to confirm it fixes your case too.

stevenh avatar Apr 02 '24 23:04 stevenh

I like the best-effort patch for the lifecycle, which will make the potential errors more evident. For the terminate method, you want to continue with the terminated hook even if we failed the removal of the container, right?

mdelapenya avatar Apr 03 '24 11:04 mdelapenya

I like the best-effort patch for the lifecycle, which will make the potential errors more evident. For the terminate method, you want to continue with the terminated hook even if we failed the removal of the container, right?

Yes that makes sense to me, I have a version of this from last night I'll tweak and put up see what folks think.

stevenh avatar Apr 03 '24 15:04 stevenh