dd-trace-go
dd-trace-go copied to clipboard
internal/globalconfig: move mutex un/lock from PushStat to StatsCarrier
What does this PR do?
Rearranges globalconfig
locking operations in PushStat
, that only calls a function that pushes into a channel, to avoid deadlocks with SetStatsCarrier
. If PushStat
locks while adding a stat into StatsCarrier's channel, it locks any further function using the same globalconfig
mutex.
It's been tested locally. It no longer fails.
Motivation
Avoid failing tests due to deadlocks:
goroutine 738 [sync.RWMutex.Lock]:
sync.runtime_SemacquireRWMutex(0x12addca40?, 0x18?, 0x18?)
/Users/dario.castane/sdk/go1.21.5/src/runtime/sema.go:87 +0x28
sync.(*RWMutex).Lock(0x14000481eb8?)
/Users/dario.castane/sdk/go1.21.5/src/sync/rwmutex.go:152 +0xf8
gopkg.in/DataDog/dd-trace-go.v1/internal/globalconfig.SetServiceName({0x1050f2c11, 0xa})
/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/internal/globalconfig/globalconfig.go:60 +0x3c
gopkg.in/DataDog/dd-trace-go.v1/contrib/internal/namingschematest.withDDService({0x1050f2c11, 0xa})
/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/contrib/internal/namingschematest/namingschematest.go:131 +0x80
gopkg.in/DataDog/dd-trace-go.v1/contrib/database/sql.TestNamingSchema.func4.NewServiceNameTest.func3.1(0x0?)
/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/contrib/internal/namingschematest/namingschematest.go:88 +0x4c
testing.tRunner(0x14000c501a0, 0x14001db8ac8)
/Users/dario.castane/sdk/go1.21.5/src/testing/testing.go:1595 +0xe8
created by testing.(*T).Run in goroutine 702
/Users/dario.castane/sdk/go1.21.5/src/testing/testing.go:1648 +0x33c
[...]
goroutine 566 [chan send]:
gopkg.in/DataDog/dd-trace-go.v1/internal.(*StatsCarrier).Add(...)
/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/internal/statsd.go:219
gopkg.in/DataDog/dd-trace-go.v1/internal/globalconfig.PushStat({0x10543fea8?, 0x14001a0a7c0?})
/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/internal/globalconfig/globalconfig.go:114 +0xe0
gopkg.in/DataDog/dd-trace-go.v1/contrib/database/sql.pollDBStats(0x140000af040, {0x14000ae1140, 0x1, 0x1})
/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/contrib/database/sql/metrics.go:44 +0x194
created by gopkg.in/DataDog/dd-trace-go.v1/contrib/database/sql.OpenDB in goroutine 564
/Users/dario.castane/go/src/github.com/DataDog/dd-trace-go/contrib/database/sql/sql.go:215 +0x2dc
Reviewer's Checklist
- [x] Changed code has unit tests for its functionality at or near 100% coverage.
- [ ] System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
- [ ] There is a benchmark for any new code, or changes to existing code.
- [ ] If this interacts with the agent in a new way, a system test has been added.
- [ ] Add an appropriate team label so this PR gets put in the right place for the release notes.
- [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.
For Datadog employees:
- [ ] If this PR touches code that handles credentials of any kind, such as Datadog API keys, I've requested a review from
@DataDog/security-design-and-guidance
. - [x] This PR doesn't touch any of that.
Unsure? Have a question? Request a review!
Benchmarks
Benchmark execution time: 2024-03-05 16:09:42
Comparing candidate commit dc64820c060e3d77fc3def6f99f7676bd260a903 in PR branch dario.castane/fix-database-sql-metrics-deadlock
with baseline commit 47f33037d379214bebcacbd68b8b76bbbf6d5606 in branch main
.
Found 0 performance improvements and 0 performance regressions! Performance is the same for 40 metrics, 1 unstable metrics.
Sorry for the comment spam, just trying to put the pieces together here so we can understand the issue, and in particular whether this is a user-facing problem or just a testing problem.
- We see the
contrib/database/sql
test suite timing out while runningTestNamingSchema
-
TestNamingSchema
in turn dispatches tocontrib/internal/namingschematest/NewServiceNameTest
-
NewServiceNameTest
is blocked inwithDDService
trying to acquire a write lock to set the global service name (link) - The same lock is held while
pollDBStats
attempts to send a value to a stats channel for the global stats client (link) - The channel send will block unless there is a goroutine running
StatsCarrier.run
to receive from the same channel. Since we don't see that function in the crash output, we conclude that it's not running and that the send will block indefinitely. - The global stats carrier is updated in this test in the
contrib/database/sql
package:- https://github.com/DataDog/dd-trace-go/blob/63d7047cfd4d8e3d078dc5460fb385e544c2eb1d/contrib/database/sql/sql_test.go#L281-L291
- Note that the stats carrier will be stopped at the end of that test case, meaning the goroutine reading from the stats channel will exit, but the global stats carrier stays the same. So subsequent tests will use the stopped stats carrier.
That's the sequence of events leading to the motivating test failures. As for whether this is a user-facing issue, we should think about whether they can get in a situation where the global stats carrier is set but stopped. I think my idea above (have StatsCarrier.Add
return if the carrier is stopped) is still reasonable. Having background goroutines get stuck forever trying to send to the closed channel could be a resource leak.
Sorry for the comment spam, just trying to put the pieces together here so we can understand the issue, and in particular whether this is a user-facing problem or just a testing problem.
Thanks for the detailed analysis. I was missing the last point.
I think my idea above (have
StatsCarrier.Add
return if the carrier is stopped) is still reasonable. Having background goroutines get stuck forever trying to send to the closed channel could be a resource leak.
I agree. I think that to avoid delaying the next release we could merge #2594 and review later your analysis to find a fix for the issue. WDYT @dianashevchenko?
@darccio I will go ahead and do that. Are we sure those PR changes are all that's necessary?
Are we sure those PR changes are all that's necessary?
I'm not sure. Let's review it later.
I definitely support removing the StatsCarrier() method - IIRC I meant to remove that anyway :)
@mtoffl01 Is this PR useful to you? Maybe we can close it.
@mtoffl01 Is this PR useful to you? Maybe we can close it.
The new implementation for dbstats does not require this statscarrier type anyway, so we can definitely close this PR 👍