datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

[system-probe] report client-side TCP failed connections

Open Yumasi opened this issue 3 years ago • 0 comments

What does this PR do?

This PR adds detection and reporting of client-side failed TCP connections. Those are sorted per connection tuples, with a counter of the number of failed attempts. This features needs this payload change: https://github.com/DataDog/agent-payload/pull/172 Pipelines for this PR will fail until the payload change is merged.

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

  • Build system-probe & start it.
  • In another shell, generate a failing connection. One way to do this is by trying to connect to a closed port:
nc localhost 10000

The connection should promptly fail.

Another case, that takes longer to test is making a connection timeout. This can be done with an iptables rule:

sudo iptables -A OUTPUT -p tcp -d 127.0.0.1 --dport 10000 -j DROP
nc localhost 10000

After around 2m nc should fail, and you can proceed to the next step.

  • Poll system-probe for connections and check the failed connection appears in the response:
sudo curl -s --unix-socket /opt/datadog-agent/run/sysprobe.sock http://unix/network_tracer/connections|jq .failedConns

The answer should look like this:

[
  {
    "pid": 110443,
    "laddr": {
      "ip": "127.0.0.1",
      "port": 57308,
      "containerId": "",
      "hostId": "0",
      "hostName": ""
    },
    "raddr": {
      "ip": "127.0.0.1",
      "port": 10000,
      "containerId": "",
      "hostId": "0",
      "hostName": ""
    },
    "family": "v4",
    "type": "tcp",
    "direction": "outgoing",
    "netNS": 4026531840,
    "failureCount": "1"
  }
]

Reviewer's Checklist

  • [ ] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [ ] A release note has been added or the changelog/no-changelog label has been applied.
  • [ ] Changed code has automated tests for its functionality.
  • [ ] Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • [ ] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • [ ] If applicable, the config template has been updated.

Yumasi avatar Sep 15 '22 12:09 Yumasi