datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

First cut of agent telemetry

Open iglendd opened this issue 1 year ago • 8 comments

What does this PR do?

Adds ability to send and query cross-org telemetry on a few internal Agent metrics. Specifically

  • Sending a few internal and built-in checks., logs., transactions., dogstatsd. metrics as well as small part of agent status
  • Disabled by default.
  • Default schedule sent 30 seconds after agent start and every 15 minutes after that
  • Sent to the org's instrumentation-telemetry-intake

Motivation

Lack of Agent own observability for insight, optimization and troubleshooting.

Additional Notes

  • Agent check telemetry (telemetry.check) is activated when agent_telemetry.enabled.
  • We need to makes sure that agent memory usage will not grow and RSS remains stables (e.g. in cases if large number of metrics is collected and thrown away from Gather() method.

Possible Drawbacks / Trade-offs

Additional cost for Datadog to process and store additional telemetry from many agents

Describe how to test/QA your changes

  • Set agent_telemetry.enabled: true
  • Restart agent
  • Wait a minute
  • Paste in a search bar source:agent-telemetry - you should see a handful of messages with Agent status and Agent metric content, which contain JSON payload.

Reviewer's Checklist

  • [ ] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [ ] A release note has been added or the changelog/no-changelog label has been applied.
  • [ ] Changed code has automated tests for its functionality.
  • [ ] Adequate QA/testing plan information is provided. Except if the qa/skip-qa label, with required either qa/done or qa/no-code-change labels, are applied.
  • [ ] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • [ ] If applicable, the config template has been updated.

iglendd avatar Feb 12 '24 13:02 iglendd

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: 2f712163-16c4-43eb-89a1-4816a6136597 Baseline: 2f1f3a8ca2b53439343d7fb27a1fed623841f8c8 Comparison: 0f04a9aaf96cd105e968865b7b1b48561a04e9c2 Total CPUs: 7

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization -0.36 [-6.89, +6.17]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
file_tree memory utilization +2.53 [+2.43, +2.63]
uds_dogstatsd_to_api_cpu % cpu utilization +2.36 [+0.92, +3.79]
tcp_syslog_to_blackhole ingress throughput +1.40 [+1.35, +1.45]
process_agent_real_time_mode memory utilization +0.47 [+0.42, +0.52]
process_agent_standard_check memory utilization +0.31 [+0.25, +0.36]
process_agent_standard_check_with_stats memory utilization +0.26 [+0.21, +0.31]
idle memory utilization +0.25 [+0.20, +0.30]
trace_agent_json ingress throughput +0.01 [-0.03, +0.04]
uds_dogstatsd_to_api ingress throughput +0.00 [-0.00, +0.00]
tcp_dd_logs_filter_exclude ingress throughput +0.00 [-0.00, +0.00]
trace_agent_msgpack ingress throughput -0.02 [-0.03, -0.01]
otel_to_otel_logs ingress throughput -0.17 [-0.77, +0.44]
file_to_blackhole % cpu utilization -0.36 [-6.89, +6.17]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

pr-commenter[bot] avatar Feb 13 '24 01:02 pr-commenter[bot]

@buraizu, regarding your comment

First,

Thanks for the PR, I've left some minor updates to the release note

I cannot find it.

Second, we are thinking to change it completely. Please hold on until further notice.

iglendd avatar Feb 15 '24 14:02 iglendd

@GustavoCaso and @vickenty please re-review when you have a chance. I am not planning to merge before Privacy review is completed but I need to run tests on staging. I would like to rebase ideally soon but afraid it will screw up reviews and its positions.

iglendd avatar Mar 06 '24 02:03 iglendd

@GustavoCaso I have addressed most of your review comments (please check it out). @vickenty can you finish your review please. I also have a question, the PR is falling behind. Do you prefer I rebase or merge main in (there is one conflict to resolve too). Thank you.

iglendd avatar Mar 08 '24 14:03 iglendd

I also have a question, the PR is falling behind. Do you prefer I rebase or merge main in (there is one conflict to resolve too). Thank you.

I think merging works a bit better than rebasing once review has started, so if possible and practical, please merge.

vickenty avatar Mar 08 '24 14:03 vickenty

I also have a question, the PR is falling behind. Do you prefer I rebase or merge main in (there is one conflict to resolve too). Thank you.

I think merging works a bit better than rebasing once review has started, so if possible and practical, please merge.

Of course. I will do merging, thank you.

iglendd avatar Mar 08 '24 14:03 iglendd

@iglendd I would like to draw some attention to my previous comment https://github.com/DataDog/datadog-agent/pull/22771#discussion_r1514260175

I just realized one thing about allowing arbitrary JQ functionality to run against the status JSON output. Once customers start to use that functionality we won't be able to modify the status JSON output without fear of breaking any telemetry extra custom metrics.

I really want to surface that to see if we could consider my suggestion of shipping the first experimental version with support for extra

I'm more than happy to chat about this via Slack, Zoom or whatever works best for you

GustavoCaso avatar Mar 08 '24 16:03 GustavoCaso

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=32545955 --os-family=ubuntu

pr-commenter[bot] avatar Mar 20 '24 02:03 pr-commenter[bot]

Regression Detector

Regression Detector Results

Run ID: cc14be03-6c0d-445b-9b89-58b659a221be Baseline: 6b8a81560496b9745bebd423039529d46ae9116d Comparison: 7545f9be2be320279af9758bf29e673e608db231

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization +1.07 [-4.89, +7.04]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
tcp_syslog_to_blackhole ingress throughput +1.47 [+1.38, +1.56]
file_to_blackhole % cpu utilization +1.07 [-4.89, +7.04]
uds_dogstatsd_to_api_cpu % cpu utilization +0.71 [-2.25, +3.67]
pycheck_1000_100byte_tags % cpu utilization +0.51 [-4.42, +5.44]
basic_py_check % cpu utilization +0.35 [-2.21, +2.91]
process_agent_standard_check memory utilization +0.18 [+0.13, +0.23]
tcp_dd_logs_filter_exclude ingress throughput +0.04 [+0.01, +0.07]
process_agent_real_time_mode memory utilization +0.04 [-0.01, +0.08]
uds_dogstatsd_to_api ingress throughput +0.02 [-0.19, +0.22]
trace_agent_json ingress throughput +0.01 [-0.01, +0.04]
trace_agent_msgpack ingress throughput -0.01 [-0.02, -0.00]
idle memory utilization -0.23 [-0.27, -0.18]
process_agent_standard_check_with_stats memory utilization -0.33 [-0.37, -0.29]
otel_to_otel_logs ingress throughput -0.46 [-0.91, -0.01]
file_tree memory utilization -0.69 [-0.81, -0.57]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

pr-commenter[bot] avatar Mar 20 '24 02:03 pr-commenter[bot]

@iglendd I would like to draw some attention to my previous comment #22771 (comment)

I just realized one thing about allowing arbitrary JQ functionality to run against the status JSON output. Once customers start to use that functionality we won't be able to modify the status JSON output without fear of breaking any telemetry extra custom metrics.

I really want to surface that to see if we could consider my suggestion of shipping the first experimental version with support for extra

I'm more than happy to chat about this via Slack, Zoom or whatever works best for you

Chatted OOTB

iglendd avatar Mar 20 '24 22:03 iglendd

Knowing that feature is experimental and that is subject to change in the future, I would like for us to reconsider in the future the use of a similar pattern from the status component as each component registering there own telemetry metrics. Ensuring decoupling between components. Today the telemetry component has strong dependency on the status component. Adding new information for the telemetry agent means having to add it the status provider of a component. If we have that in mind and in future iterations we are able to remove that dependency I think we would be a in a better position to evolve the telemetry component in isolation.

Definitely. I also planning to start using our DefaultForwarder to replace portion of sender (cc @hush-hush)

One more thing that would interesting to add in the PR description is amore detail explanation of the QA steps. ex. Paste in a search bar source:agent-telemetry Which search bar? Having as much detail information on how todo the QA the better.

I have added a few more items in the step, I hope it is a bit better.

iglendd avatar Mar 25 '24 13:03 iglendd

/merge

iglendd avatar Apr 20 '24 00:04 iglendd

:steam_locomotive: MergeQueue

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals. Note: if you pushed new commits since the last approval, you may need additional approval. You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

dd-devflow[bot] avatar Apr 20 '24 00:04 dd-devflow[bot]

/merge

iglendd avatar Apr 20 '24 00:04 iglendd

:steam_locomotive: MergeQueue

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals. Note: if you pushed new commits since the last approval, you may need additional approval. You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

dd-devflow[bot] avatar Apr 20 '24 00:04 dd-devflow[bot]

:warning: MergeQueue

This merge request was unqueued

If you need support, contact us on Slack #devflow!

dd-devflow[bot] avatar Apr 20 '24 04:04 dd-devflow[bot]