First cut of agent telemetry
What does this PR do?
Adds ability to send and query cross-org telemetry on a few internal Agent metrics. Specifically
- Sending a few internal and built-in checks., logs., transactions., dogstatsd. metrics as well as small part of agent status
- Disabled by default.
- Default schedule sent 30 seconds after agent start and every 15 minutes after that
- Sent to the org's instrumentation-telemetry-intake
Motivation
Lack of Agent own observability for insight, optimization and troubleshooting.
Additional Notes
- Agent check telemetry (telemetry.check) is activated when agent_telemetry.enabled.
- We need to makes sure that agent memory usage will not grow and RSS remains stables (e.g. in cases if large number of metrics is collected and thrown away from Gather() method.
Possible Drawbacks / Trade-offs
Additional cost for Datadog to process and store additional telemetry from many agents
Describe how to test/QA your changes
- Set agent_telemetry.enabled: true
- Restart agent
- Wait a minute
- Paste in a search bar
source:agent-telemetry- you should see a handful of messages withAgent statusandAgent metriccontent, which contain JSON payload.
Reviewer's Checklist
- [ ] If known, an appropriate milestone has been selected; otherwise the
Triagemilestone is set. - [ ] Use the
major_changelabel if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote. - [ ] A release note has been added or the
changelog/no-changeloglabel has been applied. - [ ] Changed code has automated tests for its functionality.
- [ ] Adequate QA/testing plan information is provided. Except if the
qa/skip-qalabel, with required eitherqa/doneorqa/no-code-changelabels, are applied. - [ ] At least one
team/..label has been applied, indicating the team(s) that should QA this change. - [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
- [ ] If applicable, the
need-change/operatorandneed-change/helmlabels have been applied. - [ ] If applicable, the
k8s/<min-version>label, indicating the lowest Kubernetes version compatible with this feature. - [ ] If applicable, the config template has been updated.
Bloop Bleep... Dogbot Here
Regression Detector Results
Run ID: 2f712163-16c4-43eb-89a1-4816a6136597 Baseline: 2f1f3a8ca2b53439343d7fb27a1fed623841f8c8 Comparison: 0f04a9aaf96cd105e968865b7b1b48561a04e9c2 Total CPUs: 7
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
No significant changes in experiment optimization goals
Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%
There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
Experiments ignored for regressions
Regressions in experiments with settings containing erratic: true are ignored.
| perf | experiment | goal | Δ mean % | Δ mean % CI |
|---|---|---|---|---|
| ➖ | file_to_blackhole | % cpu utilization | -0.36 | [-6.89, +6.17] |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI |
|---|---|---|---|---|
| ➖ | file_tree | memory utilization | +2.53 | [+2.43, +2.63] |
| ➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +2.36 | [+0.92, +3.79] |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +1.40 | [+1.35, +1.45] |
| ➖ | process_agent_real_time_mode | memory utilization | +0.47 | [+0.42, +0.52] |
| ➖ | process_agent_standard_check | memory utilization | +0.31 | [+0.25, +0.36] |
| ➖ | process_agent_standard_check_with_stats | memory utilization | +0.26 | [+0.21, +0.31] |
| ➖ | idle | memory utilization | +0.25 | [+0.20, +0.30] |
| ➖ | trace_agent_json | ingress throughput | +0.01 | [-0.03, +0.04] |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.00, +0.00] |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.00, +0.00] |
| ➖ | trace_agent_msgpack | ingress throughput | -0.02 | [-0.03, -0.01] |
| ➖ | otel_to_otel_logs | ingress throughput | -0.17 | [-0.77, +0.44] |
| ➖ | file_to_blackhole | % cpu utilization | -0.36 | [-6.89, +6.17] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
@buraizu, regarding your comment
First,
Thanks for the PR, I've left some minor updates to the release note
I cannot find it.
Second, we are thinking to change it completely. Please hold on until further notice.
@GustavoCaso and @vickenty please re-review when you have a chance. I am not planning to merge before Privacy review is completed but I need to run tests on staging. I would like to rebase ideally soon but afraid it will screw up reviews and its positions.
@GustavoCaso I have addressed most of your review comments (please check it out). @vickenty can you finish your review please. I also have a question, the PR is falling behind. Do you prefer I rebase or merge main in (there is one conflict to resolve too). Thank you.
I also have a question, the PR is falling behind. Do you prefer I rebase or merge main in (there is one conflict to resolve too). Thank you.
I think merging works a bit better than rebasing once review has started, so if possible and practical, please merge.
I also have a question, the PR is falling behind. Do you prefer I rebase or merge main in (there is one conflict to resolve too). Thank you.
I think merging works a bit better than rebasing once review has started, so if possible and practical, please merge.
Of course. I will do merging, thank you.
@iglendd I would like to draw some attention to my previous comment https://github.com/DataDog/datadog-agent/pull/22771#discussion_r1514260175
I just realized one thing about allowing arbitrary JQ functionality to run against the status JSON output. Once customers start to use that functionality we won't be able to modify the status JSON output without fear of breaking any telemetry extra custom metrics.
I really want to surface that to see if we could consider my suggestion of shipping the first experimental version with support for extra
I'm more than happy to chat about this via Slack, Zoom or whatever works best for you
Test changes on VM
Use this command from test-infra-definitions to manually test this PR changes on a VM:
inv create-vm --pipeline-id=32545955 --os-family=ubuntu
Regression Detector
Regression Detector Results
Run ID: cc14be03-6c0d-445b-9b89-58b659a221be Baseline: 6b8a81560496b9745bebd423039529d46ae9116d Comparison: 7545f9be2be320279af9758bf29e673e608db231
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
No significant changes in experiment optimization goals
Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%
There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
Experiments ignored for regressions
Regressions in experiments with settings containing erratic: true are ignored.
| perf | experiment | goal | Δ mean % | Δ mean % CI |
|---|---|---|---|---|
| ➖ | file_to_blackhole | % cpu utilization | +1.07 | [-4.89, +7.04] |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI |
|---|---|---|---|---|
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +1.47 | [+1.38, +1.56] |
| ➖ | file_to_blackhole | % cpu utilization | +1.07 | [-4.89, +7.04] |
| ➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +0.71 | [-2.25, +3.67] |
| ➖ | pycheck_1000_100byte_tags | % cpu utilization | +0.51 | [-4.42, +5.44] |
| ➖ | basic_py_check | % cpu utilization | +0.35 | [-2.21, +2.91] |
| ➖ | process_agent_standard_check | memory utilization | +0.18 | [+0.13, +0.23] |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.04 | [+0.01, +0.07] |
| ➖ | process_agent_real_time_mode | memory utilization | +0.04 | [-0.01, +0.08] |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.02 | [-0.19, +0.22] |
| ➖ | trace_agent_json | ingress throughput | +0.01 | [-0.01, +0.04] |
| ➖ | trace_agent_msgpack | ingress throughput | -0.01 | [-0.02, -0.00] |
| ➖ | idle | memory utilization | -0.23 | [-0.27, -0.18] |
| ➖ | process_agent_standard_check_with_stats | memory utilization | -0.33 | [-0.37, -0.29] |
| ➖ | otel_to_otel_logs | ingress throughput | -0.46 | [-0.91, -0.01] |
| ➖ | file_tree | memory utilization | -0.69 | [-0.81, -0.57] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
@iglendd I would like to draw some attention to my previous comment #22771 (comment)
I just realized one thing about allowing arbitrary JQ functionality to run against the status JSON output. Once customers start to use that functionality we won't be able to modify the status JSON output without fear of breaking any telemetry
extracustom metrics.I really want to surface that to see if we could consider my suggestion of shipping the first experimental version with support for
extraI'm more than happy to chat about this via Slack, Zoom or whatever works best for you
Chatted OOTB
Knowing that feature is experimental and that is subject to change in the future, I would like for us to reconsider in the future the use of a similar pattern from the status component as each component registering there own telemetry metrics. Ensuring decoupling between components. Today the telemetry component has strong dependency on the status component. Adding new information for the telemetry agent means having to add it the status provider of a component. If we have that in mind and in future iterations we are able to remove that dependency I think we would be a in a better position to evolve the telemetry component in isolation.
Definitely. I also planning to start using our DefaultForwarder to replace portion of sender (cc @hush-hush)
One more thing that would interesting to add in the PR description is amore detail explanation of the QA steps. ex.
Paste in a search bar source:agent-telemetryWhich search bar? Having as much detail information on how todo the QA the better.
I have added a few more items in the step, I hope it is a bit better.
/merge
:steam_locomotive: MergeQueue
This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.
Use /merge -c to cancel this operation!
/merge
:steam_locomotive: MergeQueue
This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.
Use /merge -c to cancel this operation!
:warning: MergeQueue
This merge request was unqueued
If you need support, contact us on Slack #devflow!