dd-trace-go icon indicating copy to clipboard operation
dd-trace-go copied to clipboard

internal/globalconfig: move mutex un/lock from PushStat to StatsCarrier

Open darccio opened this issue 3 months ago • 6 comments

What does this PR do?

Rearranges globalconfig locking operations in PushStat, that only calls a function that pushes into a channel, to avoid deadlocks with SetStatsCarrier. If PushStat locks while adding a stat into StatsCarrier's channel, it locks any further function using the same globalconfig mutex.

It's been tested locally. It no longer fails.

Motivation

Avoid failing tests due to deadlocks:

goroutine 738 [sync.RWMutex.Lock]:
sync.runtime_SemacquireRWMutex(0x12addca40?, 0x18?, 0x18?)
	/Users/dario.castane/sdk/go1.21.5/src/runtime/sema.go:87 +0x28
sync.(*RWMutex).Lock(0x14000481eb8?)
	/Users/dario.castane/sdk/go1.21.5/src/sync/rwmutex.go:152 +0xf8
gopkg.in/DataDog/dd-trace-go.v1/internal/globalconfig.SetServiceName({0x1050f2c11, 0xa})
	/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/internal/globalconfig/globalconfig.go:60 +0x3c
gopkg.in/DataDog/dd-trace-go.v1/contrib/internal/namingschematest.withDDService({0x1050f2c11, 0xa})
	/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/contrib/internal/namingschematest/namingschematest.go:131 +0x80
gopkg.in/DataDog/dd-trace-go.v1/contrib/database/sql.TestNamingSchema.func4.NewServiceNameTest.func3.1(0x0?)
	/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/contrib/internal/namingschematest/namingschematest.go:88 +0x4c
testing.tRunner(0x14000c501a0, 0x14001db8ac8)
	/Users/dario.castane/sdk/go1.21.5/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 702
	/Users/dario.castane/sdk/go1.21.5/src/testing/testing.go:1648 +0x33c

[...]

goroutine 566 [chan send]:
gopkg.in/DataDog/dd-trace-go.v1/internal.(*StatsCarrier).Add(...)
	/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/internal/statsd.go:219
gopkg.in/DataDog/dd-trace-go.v1/internal/globalconfig.PushStat({0x10543fea8?, 0x14001a0a7c0?})
	/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/internal/globalconfig/globalconfig.go:114 +0xe0
gopkg.in/DataDog/dd-trace-go.v1/contrib/database/sql.pollDBStats(0x140000af040, {0x14000ae1140, 0x1, 0x1})
	/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/contrib/database/sql/metrics.go:44 +0x194
created by gopkg.in/DataDog/dd-trace-go.v1/contrib/database/sql.OpenDB in goroutine 564
	/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/contrib/database/sql/sql.go:215 +0x2dc

Reviewer's Checklist

  • [x] Changed code has unit tests for its functionality at or near 100% coverage.
  • [ ] System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • [ ] There is a benchmark for any new code, or changes to existing code.
  • [ ] If this interacts with the agent in a new way, a system test has been added.
  • [ ] Add an appropriate team label so this PR gets put in the right place for the release notes.
  • [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

For Datadog employees:

  • [ ] If this PR touches code that handles credentials of any kind, such as Datadog API keys, I've requested a review from @DataDog/security-design-and-guidance.
  • [x] This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

darccio avatar Mar 05 '24 11:03 darccio

Benchmarks

Benchmark execution time: 2024-03-05 16:09:42

Comparing candidate commit dc64820c060e3d77fc3def6f99f7676bd260a903 in PR branch dario.castane/fix-database-sql-metrics-deadlock with baseline commit 47f33037d379214bebcacbd68b8b76bbbf6d5606 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 40 metrics, 1 unstable metrics.

pr-commenter[bot] avatar Mar 05 '24 11:03 pr-commenter[bot]

Sorry for the comment spam, just trying to put the pieces together here so we can understand the issue, and in particular whether this is a user-facing problem or just a testing problem.

  • We see the contrib/database/sql test suite timing out while running TestNamingSchema
  • TestNamingSchema in turn dispatches to contrib/internal/namingschematest/NewServiceNameTest
  • NewServiceNameTest is blocked in withDDService trying to acquire a write lock to set the global service name (link)
  • The same lock is held while pollDBStats attempts to send a value to a stats channel for the global stats client (link)
  • The channel send will block unless there is a goroutine running StatsCarrier.run to receive from the same channel. Since we don't see that function in the crash output, we conclude that it's not running and that the send will block indefinitely.
  • The global stats carrier is updated in this test in the contrib/database/sql package:
    • https://github.com/DataDog/dd-trace-go/blob/63d7047cfd4d8e3d078dc5460fb385e544c2eb1d/contrib/database/sql/sql_test.go#L281-L291
  • Note that the stats carrier will be stopped at the end of that test case, meaning the goroutine reading from the stats channel will exit, but the global stats carrier stays the same. So subsequent tests will use the stopped stats carrier.

That's the sequence of events leading to the motivating test failures. As for whether this is a user-facing issue, we should think about whether they can get in a situation where the global stats carrier is set but stopped. I think my idea above (have StatsCarrier.Add return if the carrier is stopped) is still reasonable. Having background goroutines get stuck forever trying to send to the closed channel could be a resource leak.

nsrip-dd avatar Mar 05 '24 17:03 nsrip-dd

Sorry for the comment spam, just trying to put the pieces together here so we can understand the issue, and in particular whether this is a user-facing problem or just a testing problem.

Thanks for the detailed analysis. I was missing the last point.

I think my idea above (have StatsCarrier.Add return if the carrier is stopped) is still reasonable. Having background goroutines get stuck forever trying to send to the closed channel could be a resource leak.

I agree. I think that to avoid delaying the next release we could merge #2594 and review later your analysis to find a fix for the issue. WDYT @dianashevchenko?

darccio avatar Mar 06 '24 08:03 darccio

@darccio I will go ahead and do that. Are we sure those PR changes are all that's necessary?

dianashevchenko avatar Mar 06 '24 16:03 dianashevchenko

Are we sure those PR changes are all that's necessary?

I'm not sure. Let's review it later.

darccio avatar Mar 06 '24 16:03 darccio

I definitely support removing the StatsCarrier() method - IIRC I meant to remove that anyway :)

mtoffl01 avatar Mar 07 '24 16:03 mtoffl01

@mtoffl01 Is this PR useful to you? Maybe we can close it.

darccio avatar Mar 20 '24 15:03 darccio

@mtoffl01 Is this PR useful to you? Maybe we can close it.

The new implementation for dbstats does not require this statscarrier type anyway, so we can definitely close this PR 👍

mtoffl01 avatar Mar 20 '24 16:03 mtoffl01