Add ARC support
Description
This PR introduces an Adaptive Request Concurrency (ARC) controller to the exporterhelper.
When enabled via the new sending_queue.arc.enabled flag, this controller dynamically manages the number of concurrent export requests, effectively overriding the static num_consumers setting. It adjusts the concurrency limit based on observed RTT (Round-Trip Time) and backpressure signals (e.g., HTTP 429/503, gRPC ResourceExhausted/Unavailable).
The controller follows an AIMD (Additive Increase, Multiplicative Decrease) pattern to find the optimal concurrency limit, maximizing throughput during healthy operation and automatically backing off upon detecting export failures or RTT spikes.
This feature is disabled by default and introduces no behavior change unless explicitly enabled. It also adds a new set of otelcol_exporter_arc_* metrics (detailed in the documentation) for observing its behavior.
Link to tracking issue
Fixes #14080
Testing
- Added comprehensive unit tests for the core ARC logic in
internal/arc/controller_test.go, covering additive increase, multiplicative decrease (TestAdjustIncreaseAndDecrease), and the cold-start backoff heuristic (TestEarlyBackoffOnColdStart). - Added specific unit tests for the new
shrinkSem(a custom shrinkable semaphore) to validate its concurrency, prioritization, and shutdown safety. - Added a critical test (
TestController_Shutdown_UnblocksWaiters) to ensure that any goroutines blocked onAcquireare correctly unblocked with a shutdown error, preventing collector hangs. - Added a new integration test in
internal/queue_sender_test.go(TestQueueSender_ArcAcquireWaitMetric) that validates the end-to-end flow. It confirms that when the limit is reached, new requests block onAcquireand theexporter_arc_acquire_wait_msmetric records the wait time. - Added unit tests for the new
internal/experr/back_pressure.goutility to verify its detection logic.
Documentation
- Updated
exporterhelper/README.mdto include the newsending_queue.arcblock with all its configuration options. - Updated
exporterhelper/metadata.yamlto define all newotelcol_exporter_arc_*metrics, which are in turn reflected in the generateddocumentation.md.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: raghu999 / name: Raghu999 (5d47d747692e8c965041c275aead73543bc96cad, 9d9674fede1a6a0f0a854aae895b0153c312477a)
# HELP otelcol_exporter_arc_acquire_wait_ms_milliseconds Time a worker waited to acquire an ARC permit. [Alpha]
# TYPE otelcol_exporter_arc_acquire_wait_ms_milliseconds histogram
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="0"} 114
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="5"} 114
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="10"} 114
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="25"} 114
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="50"} 115
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="75"} 115
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="100"} 115
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="250"} 117
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="500"} 125
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="750"} 133
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="1000"} 137
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="2500"} 156
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="5000"} 210
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="7500"} 221
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="10000"} 221
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="+Inf"} 221
otelcol_exporter_arc_acquire_wait_ms_milliseconds_sum{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 301819
otelcol_exporter_arc_acquire_wait_ms_milliseconds_count{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 221
# HELP otelcol_exporter_arc_limit Current ARC dynamic concurrency limit. [Alpha]
# TYPE otelcol_exporter_arc_limit gauge
otelcol_exporter_arc_limit{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 10
# HELP otelcol_exporter_arc_limit_changes_total Number of times ARC changed its concurrency limit. [Alpha]
# TYPE otelcol_exporter_arc_limit_changes_total counter
otelcol_exporter_arc_limit_changes_total{data_type="traces",direction="up",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 9
# HELP otelcol_exporter_arc_permits_in_use Number of permits currently acquired. [Alpha]
# TYPE otelcol_exporter_arc_permits_in_use gauge
otelcol_exporter_arc_permits_in_use{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 10
# HELP otelcol_exporter_arc_rtt_ms_milliseconds Request round-trip-time measured by ARC (from permit acquire to release). [Alpha]
# TYPE otelcol_exporter_arc_rtt_ms_milliseconds histogram
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="0"} 0
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="5"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="10"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="25"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="50"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="75"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="100"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="250"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="500"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="750"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="1000"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="2500"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="5000"} 109
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="7500"} 205
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="10000"} 206
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="+Inf"} 211
otelcol_exporter_arc_rtt_ms_milliseconds_sum{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 1.045341e+06
otelcol_exporter_arc_rtt_ms_milliseconds_count{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 211
go tool -modfile /Users/rchall201/work/observability/unified-ingest/platform/opentelemetry-collector/internal/tools/go.mod gotestsum --packages="./..." -- -timeout 240s -race
✓ internal/hosttest (cached)
∅ internal/oteltest
✓ internal/experr (cached)
✓ internal/metadatatest (cached)
✓ internal/metadata (cached)
∅ internal/requesttest
✓ internal/queue (cached)
✓ internal/request (cached)
✓ internal/sender (cached)
✓ internal/sendertest (cached)
✓ internal/queuebatch (cached)
∅ internal/storagetest
✓ internal/sizer (cached)
✓ . (1.415s)
✓ internal/arc (1.918s)
✓ internal (2.517s)
DONE 412 tests in 2.570s
Added the test results and generated metrics
This was brought up at KubeCon. I think we need to discuss this at a SIG meeting.
CodSpeed Performance Report
Merging #14144 will improve performances by 35.05%
Comparing raghu999:arc-feature (b541b7a) with main (c44a402)
:warning: Unknown Walltime execution environment detected
Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.
For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.
Summary
⚡ 1 improvement
✅ 115 untouched
Benchmarks breakdown
| Benchmark | BASE |
HEAD |
Change | |
|---|---|---|---|---|
| ⚡ | BenchmarkSplittingBasedOnItemCountHugeLogs |
46.7 ms | 34.6 ms | +35.05% |