datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

Stats underflow fixes

Open hmahmood opened this issue 2 years ago • 3 comments

What does this PR do?

Reduce/eliminate stats "underflows" -- cases where we get a lower value for some monotonic stats for a connection than the ones we stored in client state.

There seem to be two main causes of underflows:

  1. ebpf map iteration reporting a connection multiple times for the same key. This is explicitly warned about here: https://github.com/cilium/ebpf/blob/master/map.go#L1336 . We have no synchronization in the ebpf layer or userspace when it comes to iterating/modifying the connection stats map, which is likely the cause of the these "duplicate" connections. We do not handle this case at all when we're iterating over the connections, just replacing the existing connection with the "latest" one in our map by key. This can cause the code in mergeConnections to behave unexpectedly and cause perceived underflows
  2. race between the close connection and stats update path in ebpf for a connection. This could cause a connection that was removed by the closed connection path to get recreated in the stats update path since they can both run concurrently. When this happens the stats reset and we can't distinguish between the old and new stats in userspace

To fix underflows, this PR:

  1. introduces a new field, cookie, in the connection stats object in ebpf. This is just a randomly generated u32 that is generated every time we attempt to create a connection stats object in ebpf. This assures that if a new connection stats object is recreated, we will have a different cookie from before and so will be able to tell in userspace whether the stats were reset for a connection.
  2. drops use of NAT info for connection keys. NAT info is not used for the key in ebpf and so we cannot guarantee in userspace that the NAT info attached to a connection is correct (i.e. for that connection). This is because there is a race between when the connection gets created and when we lookup/receive the NAT info for it in userspace. We also set NAT info to nil for a connection if we determine that two connections with the same key in the same connection payload have different NAT info.

In testing on the load testing cluster as well as staging, there is a marked decrease in traffic spikiness as well as byte counts. On some clusters, there is also an increase in N/A traffic since we drop NAT info more aggressively now.

Remaining work

Something similar needs to also be done for tcp stats which are stored separately in ebpf

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Reviewer's Checklist

  • [x] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [x] A release note has been added or the changelog/no-changelog label has been applied.
  • [ ] Changed code has automated tests for its functionality.
  • [x] Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • [x] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the config template has been updated.

hmahmood avatar Jul 06 '22 18:07 hmahmood

@hmahmood looks like ByteKey is dead code now

p-lambert avatar Jul 15 '22 15:07 p-lambert

It seems like some of this could be solved by handling a closed connection later in the lifecycle when it isn't possible for additional data to be sent/recv. Have you looked into the destroy callbacks, like tcp_v4_destroy_sock or inet_csk_destroy_sock?

brycekahle avatar Jul 27 '22 22:07 brycekahle

It seems like some of this could be solved by handling a closed connection later in the lifecycle when it isn't possible for additional data to be sent/recv. Have you looked into the destroy callbacks, like tcp_v4_destroy_sock or inet_csk_destroy_sock?

That is a good idea ... I made the changes.

hmahmood avatar Jul 28 '22 20:07 hmahmood