datadog-agent
datadog-agent copied to clipboard
Stats underflow fixes
What does this PR do?
Reduce/eliminate stats "underflows" -- cases where we get a lower value for some monotonic stats for a connection than the ones we stored in client state.
There seem to be two main causes of underflows:
- ebpf map iteration reporting a connection multiple times for the same key. This is explicitly warned about here: https://github.com/cilium/ebpf/blob/master/map.go#L1336 . We have no synchronization in the ebpf layer or userspace when it comes to iterating/modifying the connection stats map, which is likely the cause of the these "duplicate" connections. We do not handle this case at all when we're iterating over the connections, just replacing the existing connection with the "latest" one in our map by key. This can cause the code in
mergeConnections
to behave unexpectedly and cause perceived underflows - race between the close connection and stats update path in ebpf for a connection. This could cause a connection that was removed by the closed connection path to get recreated in the stats update path since they can both run concurrently. When this happens the stats reset and we can't distinguish between the old and new stats in userspace
To fix underflows, this PR:
- introduces a new field,
cookie
, in the connection stats object in ebpf. This is just a randomly generated u32 that is generated every time we attempt to create a connection stats object in ebpf. This assures that if a new connection stats object is recreated, we will have a different cookie from before and so will be able to tell in userspace whether the stats were reset for a connection. - drops use of NAT info for connection keys. NAT info is not used for the key in ebpf and so we cannot guarantee in userspace that the NAT info attached to a connection is correct (i.e. for that connection). This is because there is a race between when the connection gets created and when we lookup/receive the NAT info for it in userspace. We also set NAT info to
nil
for a connection if we determine that two connections with the same key in the same connection payload have different NAT info.
In testing on the load testing cluster as well as staging, there is a marked decrease in traffic spikiness as well as byte counts. On some clusters, there is also an increase in N/A
traffic since we drop NAT info more aggressively now.
Remaining work
Something similar needs to also be done for tcp stats which are stored separately in ebpf
Motivation
Additional Notes
Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
Reviewer's Checklist
- [x] If known, an appropriate milestone has been selected; otherwise the
Triage
milestone is set. - [ ] Use the
major_change
label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote. - [x] A release note has been added or the
changelog/no-changelog
label has been applied. - [ ] Changed code has automated tests for its functionality.
- [x] Adequate QA/testing plan information is provided if the
qa/skip-qa
label is not applied. - [x] At least one
team/..
label has been applied, indicating the team(s) that should QA this change. - [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
- [ ] If applicable, the
need-change/operator
andneed-change/helm
labels have been applied. - [ ] If applicable, the config template has been updated.
@hmahmood looks like ByteKey
is dead code now
It seems like some of this could be solved by handling a closed connection later in the lifecycle when it isn't possible for additional data to be sent/recv. Have you looked into the destroy
callbacks, like tcp_v4_destroy_sock
or inet_csk_destroy_sock
?
It seems like some of this could be solved by handling a closed connection later in the lifecycle when it isn't possible for additional data to be sent/recv. Have you looked into the
destroy
callbacks, liketcp_v4_destroy_sock
orinet_csk_destroy_sock
?
That is a good idea ... I made the changes.