datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

Limit number of kernel header download attempts

Open ISauve opened this issue 2 years ago • 1 comments

What does this PR do?

Limits the number of times kernel header downloading can be attempted to a maximum of 2 times.

Motivation

Every time an eBPF asset is runtime compiled, kernel headers need to be fetched. Currently, in the event that local headers can't be found and kernel header downloading fails, all subsequent attempts to fetch kernel headers will re-attempt the download. If all eBPF runtime compilable assets are enabled, this means we can attempt to download kernel headers up to 7 times. Since kernel header downloading can be a resource intensive process, we should limit the number of times which kernel header downloading can be attempted.

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

  1. Enable all runtime compilable assets and debug logging:
system_probe_config:
  log_level: debug
  enable_runtime_compiler: true
  enable_oom_kill: true
  enable_tcp_queue_length: true
network_config:
  enabled: true
service_monitoring_config:
  enabled: true
runtime_security_config:
  enabled: true
  runtime_compilation:
    enabled: true
    compiled_constants_enabled: true
  1. a) If you have previously performed runtime compilation, remove any previously compiled assets:
sudo rm -rf /var/tmp/datadog-agent/system-probe/build
  1. Run the system-probe, and verify in the logs that headers were fetched during runtime compilation of the first asset, but not for subsequent assets:
| starting runtime compilation of tracer.c
| beginning search for kernel headers
| found valid kernel headers at [some location]
| successfully compiled runtime version of tracer.c
...
| starting runtime compilation of [conntrack.c, http.c, tcp-queue-length.c, oom-kill.c, constant_fetcher.c, runtime-security.c]
| kernel headers requested: returning result of previous search
| successfully compiled runtime version of [conntrack.c, http.c, tcp-queue-length.c, oom-kill.c, constant_fetcher.c, runtime-security.c]
  1. In a separate window, query the connections endpoint and verify that kernelHeaderFetchResult indicates the correct result:
sudo curl -s --unix-socket /opt/datadog-agent/run/sysprobe.sock http://unix/network_tracer/connections?client_id=1 | jq .kernelHeaderFetchResult
  1. Stop the system-probe. Remove the compiled assets:
sudo rm -rf /var/tmp/datadog-agent/system-probe/build
  1. Force kernel header downloading to occur by adding the following to your configuration:
system_probe_config:
  kernel_header_dirs: /dev/null
  1. Turn off your wi-fi
  2. Run the system-probe and verify in the logs that kernel header downloading was attempted and retried exactly once during compilation of the first asset, but not for subsequent assets:
| starting runtime compilation of tracer.c
| beginning search for kernel headers
| Downloading kernel headers for [host details]
| unable to download kernel headers: failed to download kernel headers: [some error]. Waiting 5 seconds and retrying kernel header download.
| Downloading kernel headers for [host details]
| Unable to find kernel headers: [some error]
| error compiling network tracer, falling back to pre-compiled: unable to find kernel headers
... 
| starting runtime compilation of [conntrack.c, http.c, tcp-queue-length.c, oom-kill.c, constant_fetcher.c, runtime-security.c]
| kernel headers requested: returning result of previous search
| unable to compile [asset]: unable to find kernel headers
  1. In a separate window, query the connections endpoint again and verify that kernelHeaderFetchResult and .compilationTelemetryByAsset indicate the correct results:
sudo curl -s --unix-socket /opt/datadog-agent/run/sysprobe.sock http://unix/network_tracer/connections?client_id=1 | jq .kernelHeaderFetchResult,.compilationTelemetryByAsset

Reviewer's Checklist

  • [x] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [x] A release note has been added or the changelog/no-changelog label has been applied.
  • [x] Changed code has automated tests for its functionality.
  • [x] Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • [x] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • [ ] If applicable, the config template has been updated.

ISauve avatar Sep 16 '22 19:09 ISauve

There seem to be other changes here other than the retry. Can we split this up into the retry changes and the other ones?

hmahmood avatar Sep 19 '22 15:09 hmahmood

@hmahmood This branch was based on the isabelle.sauve/refactor-RC branch, and I think the changes from that branch accidentally got pulled into this PR. Now that #13527 is merged & this branch is rebased on main, it contains only the changes related to limiting the number of times kernel header downloading is attempted.

ISauve avatar Oct 17 '22 19:10 ISauve