datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

[system-probe] Add process monitoring and USM tagging

Open hmahmood opened this issue 2 years ago • 1 comments

What does this PR do?

  • Adds USM tags for connection processes. Only DD_ENV, DD_SERVICE, and DD_VERSION environment variables are added as env:, service:, and version: respectively.
  • runtime_security_config.event_monitoring.enabled has been replaced with two new configs, event_monitoring_config.network_process.enabled, and event_monitoring_config.process.enabled, both false by default. Both will turn on the runtime security module
  • A new config event_monitoring_config.network_process.max_tracked_processes, set to 1024 by default; this is the size of the process LRU cache described below
  • process data from the runtime security module is stored in a new LRU cache. Only process data for processes that have the USM environment variables (see above) or have a container ID are stored

Motivation

Additional Notes

~33% increase in cpu; mostly coming from the runtime security module. system-probe avg k8s CPU Usage by namespace (autosmoothed) (11)

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

  • enable network tracer process event monitoring by setting in the system-probe config:
event_monitoring_config:
  network_process:
    enabled: true
  • run system-probe with above config
  • in a separate console/terminal on the same machine, run
DD_ENV=env DD_SERVICE=service DD_VERSION=version FOO=bar curl https://www.google.com
  • query the system-probe for connections with curl --unix-socket /opt/datadog-agent/run/sysprobe.sock http://unix/connections (note the unix socket path could be different in your environment; check the system-probe log file for the path). You should see the tags field in the returned json set to
  "tags": [
    "env:env",
    "version:version",
    "service:service"
  ],

on the connection entry for google.com, you should see:

      "tags": [
        0,
        1,
        2
      ],

Reviewer's Checklist

  • [x] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [x] A release note has been added or the changelog/no-changelog label has been applied.
  • [ ] Changed code has automated tests for its functionality.
  • [x] Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • [x] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the config template has been updated.

hmahmood avatar Jun 03 '22 23:06 hmahmood

For visibility, noting that this change requires kernel version 4.10 because CWS uses LRU maps.

brycekahle avatar Jul 21 '22 15:07 brycekahle