datadog-agent
datadog-agent copied to clipboard
(pkg/trace) Change the telemetry proxy to respond immediately and proxy request in tghe background
What does this PR do?
Instead of proxying requests and waiting for response, we write a success immediately to the client, and spawn a goroutine that forwards the routine in the background.
To still have backpressure and not buffer too many requests, if the throughput is lower than what the backend accepts, this puts a limit of up to 100 concurrent forwarded telemetry requests, and respond with 429 if we're past this.
Motivation
We currently have a telemetry proxy endpoint in the trace agent. It forwards requests and waits for the backend response to forward the response. The response itself has very limited value, because the EVP intake also does async handling of messages.
But the backend can be stall requests for up to 15s in most extreme cases, which blocks client libraries sending telemetry due to the synchronous nature of the proxy.
This is fine for long-lived application using APM libraries since they can maintain a background thread that sends telemetry, but lately a lot of projects have needing fire-and-forget telemetry.
In particular, there are two cases where buffering requests and returning immediately will be very helpful:
- The SSI injector. It sends a telemetry metric on every successful injection, and blocks the injector process (for up to 1s. This timeout is both too large, since adding 1s to a new process startup is very bad UX, and too short since the EVP endpoint can have long latencies up to multiple tine per days, where have telemetry blips).
- Crashtracking. The crashtrackers can send telemetry on SIGSEGV but waits for the telemetry to be received before exiting.
Additional Notes
Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
Benchmarks
Benchmark execution time: 2024-06-26 18:09:01
Comparing candidate commit 663270a304acd1ae285e5cea74d96e38e15fd31d in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit 6bf1f4d570fb75d0b1e54b5cb9cb033cf2743213 in branch main.
Found 0 performance improvements and 0 performance regressions! Performance is the same for 2 metrics, 1 unstable metrics.
Regression Detector
Regression Detector Results
Run ID: 3288ebcd-9840-4970-aba4-38620dd7ca41 Metrics dashboard Target profiles
Baseline: 5461e795815382954f1f9d33c8542c3534bc7ea2 Comparison: 2228cb1f3e5988bf1f0716b771e3eb75b68b1a37
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
No significant changes in experiment optimization goals
Confidence level: 90.00% Effect size tolerance: |Δ mean %| ≥ 5.00%
There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | links |
|---|---|---|---|---|---|
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +2.07 | [-11.06, +15.19] | Logs |
| ➖ | file_tree | memory utilization | +0.50 | [+0.46, +0.54] | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.00, +0.00] | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.01, +0.01] | Logs |
| ➖ | idle | memory utilization | -0.32 | [-0.36, -0.28] | Logs |
| ➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.43 | [-1.31, +0.46] | Logs |
| ➖ | otel_to_otel_logs | ingress throughput | -0.44 | [-1.25, +0.37] | Logs |
| ➖ | pycheck_1000_100byte_tags | % cpu utilization | -0.61 | [-5.46, +4.24] | Logs |
| ➖ | basic_py_check | % cpu utilization | -1.75 | [-4.49, +1.00] | Logs |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Benchmarks
Benchmark execution time: 2024-06-27 18:17:03
Comparing candidate commit 51e40d677bd8fe63e9d5e9da0c4cdf71c3adba57 in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit ac6df4bac214e2ff6a242c6315418363f34abfe5 in branch main.
Found 0 performance improvements and 0 performance regressions! Performance is the same for 2 metrics, 1 unstable metrics.
Test changes on VM
Use this command from test-infra-definitions to manually test this PR changes on a VM:
inv create-vm --pipeline-id=38438025 --os-family=ubuntu
Note: This applies to commit 2228cb1f
Benchmarks
Benchmark execution time: 2024-06-28 13:36:20
Comparing candidate commit 517943e1a4848c0c1ea779bf4e109d5b420d5d92 in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit 7647aca5b87645fc72924a37e4cb94d43614eb2f in branch main.
Found 0 performance improvements and 0 performance regressions! Performance is the same for 2 metrics, 1 unstable metrics.
Benchmarks
Benchmark execution time: 2024-06-28 16:53:17
Comparing candidate commit c81d9647903cb1dd24dc580cd0b9920c1a2b1034 in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit aa64500157741181fbc4d8cc9c4f28079966b776 in branch main.
Found 0 performance improvements and 0 performance regressions! Performance is the same for 2 metrics, 1 unstable metrics.
Benchmarks
Benchmark execution time: 2024-07-02 22:36:11
Comparing candidate commit 44ec7e72a7b3eb9b9e79c9bf58d99ec4085822a4 in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit 69c62341339ccee319dd6592aca46af248b38dfc in branch main.
Found 0 performance improvements and 1 performance regressions! Performance is the same for 1 metrics, 1 unstable metrics.
scenario:BenchmarkAgentTraceProcessing-24
- 🟥
allocated_mem[+1.675MB; +1.901MB] or [+69.797%; +79.193%]
Benchmarks
Benchmark execution time: 2024-07-03 11:36:43
Comparing candidate commit 21fcbaf307496e31c0295956c40a24c21502d90d in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit d99badb1437dfd81789e4887cef3fd0fa0d3640e in branch main.
Found 0 performance improvements and 1 performance regressions! Performance is the same for 1 metrics, 1 unstable metrics.
scenario:BenchmarkAgentTraceProcessing-24
- 🟥
allocated_mem[+1.676MB; +1.860MB] or [+69.274%; +76.900%]
I added a goroutine pool to forward the requests. This allows to limit the number of inflight requests as well as the number of inflight bytes
Benchmarks
Benchmark execution time: 2024-07-03 14:28:25
Comparing candidate commit fadf080d4c6554ca5f4d1c61db0cf74cf3538ed0 in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit 00e95ee7723e2434ac953ecafcd57b8c016d7dbe in branch main.
Found 0 performance improvements and 1 performance regressions! Performance is the same for 1 metrics, 1 unstable metrics.
scenario:BenchmarkAgentTraceProcessing-24
- 🟥
allocated_mem[+1.602MB; +1.832MB] or [+66.130%; +75.651%]
Benchmarks
Benchmark execution time: 2024-07-03 22:28:41
Comparing candidate commit b91a918ae543e3ccff3f191e63d6c6a2f583b6ee in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit 39ef7135b74badf0abd71139b2b2432e4f69ff4b in branch main.
Found 0 performance improvements and 1 performance regressions! Performance is the same for 1 metrics, 1 unstable metrics.
scenario:BenchmarkAgentTraceProcessing-24
- 🟥
allocated_mem[+1.614MB; +1.763MB] or [+66.880%; +73.077%]
Benchmarks
Benchmark execution time: 2024-07-03 22:32:54
Comparing candidate commit c1438ee94477df398471fa4b9fe4144a69cba1e9 in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit 39ef7135b74badf0abd71139b2b2432e4f69ff4b in branch main.
Found 0 performance improvements and 1 performance regressions! Performance is the same for 1 metrics, 1 unstable metrics.
scenario:BenchmarkAgentTraceProcessing-24
- 🟥
allocated_mem[+1.595MB; +1.806MB] or [+65.343%; +73.980%]
Benchmarks
Benchmark execution time: 2024-07-05 12:57:16
Comparing candidate commit 2228cb1f3e5988bf1f0716b771e3eb75b68b1a37 in PR branch paullgdc/telemetry/make_telemetry_proxy_async with baseline commit 5461e795815382954f1f9d33c8542c3534bc7ea2 in branch main.
Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.
/merge
:steam_locomotive: MergeQueue: waiting for PR to be ready
This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.
Use /merge -c to cancel this operation!
:steam_locomotive: MergeQueue: pull request added to the queue
The median merge time in main is 24m.
Use /merge -c to cancel this operation!