opentelemetry-collector
opentelemetry-collector copied to clipboard
exporterqueue.asyncQueue won't start if persistant queue is full, recovered dispatched requests exists and sending queue is configured to blocking mode
Component(s)
exporter/exporterqueue
What happened?
Describe the bug At start-up, an exporter queue configured with persistance and blocking behavior will hang up forever when it has to recover unfinished dispatching requests but not enough remaining queue capacity.
Steps to reproduce Configure an exporter with the following sending_queue settings:
- persistent storage (ex: file_storage)
- blocking: true
you may need to enable retry_on_failure as well depending on the troughputs.
Start the collector with a configuration where your exporter cannot export (eg: non answering remote), let the persistent queue fills up and then shutdown the collector while there is an active request export trial.
Start the collector again.
What did you expect to see? exporterqueue should start, eventually recovering everything in the queue and any unfinished request that was dispatching at shutdown.
What did you see instead? A completely deadlock pipeline because exporterqueue is waiting for persistent queue to have available storage capacity to recover unfinished requests while queue consumers have not yet been started.
exporterqueue.(*cond).Wait (cond.go:46) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).putInternal (persistent_queue.go:248) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).retrieveAndEnqueueNotDispatchedReqs (persistent_queue.go:444) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).initClient (persistent_queue.go:139) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).Start (persistent_queue.go:119) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).Start (persistent_queue.go:114) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*asyncQueue).Start (async_queue.go:32) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*asyncQueue).Start (async_queue.go:31) go.opentelemetry.io/collector/exporter/exporterqueue
<autogenerated>:2
internal.(*QueueSender).Start (queue_sender.go:159) go.opentelemetry.io/collector/exporter/exporterhelper/internal
internal.(*BaseExporter).Start (base_exporter.go:137) go.opentelemetry.io/collector/exporter/exporterhelper/internal
<autogenerated>:2
<autogenerated>:2
graph.(*Graph).StartAll (graph.go:419) go.opentelemetry.io/collector/service/internal/graph
service.(*Service).Start (service.go:273) go.opentelemetry.io/collector/service
otelcol.(*Collector).setupConfigurationComponents (collector.go:239) go.opentelemetry.io/collector/otelcol
otelcol.(*Collector).Run (collector.go:293) go.opentelemetry.io/collector/otelcol
otelcol.NewCommand.func1 (command.go:39) go.opentelemetry.io/collector/otelcol
cobra.(*Command).execute (command.go:985) github.com/spf13/cobra
cobra.(*Command).ExecuteC (command.go:1117) github.com/spf13/cobra
cobra.(*Command).Execute (command.go:1041) github.com/spf13/cobra
main.runInteractive (main.go:55) main
main.run (main_others.go:10) main
main.main (main.go:48) main
runtime.main (proc.go:272) runtime
runtime.goexit (asm_amd64.s:1700) runtime
- Async Stack Trace
<autogenerated>:2
Collector version
v1.26.0/v0.120.0
Environment information
Environment
OS: Ubuntu 20.04.1 Compiler(if manually compiled): go1.23.6
OpenTelemetry Collector configuration
extensions:
file_storage/exporter:
directory: ./data
receivers:
hostmetrics:
collection_interval: 100ms
scrapers:
cpu:
exporters:
otlphttp:
endpoint: http://localhost:9999 # nothing listening here
sending_queue:
enabled: true
blocking: true
num_consumers: 1
queue_size: 5
storage: file_storage/exporter
retry_on_failure:
enabled: true
max_elapsed_time: 0 # this will help having a queue full at shutdown
service:
extensions: [file_storage/exporter]
pipelines:
metrics:
receivers: [hostmetrics]
exporters: [otlphttp]
telemetry:
logs:
level: DEBUG
Log output
2025-03-04T11:25:42.415+0100 info [email protected]/service.go:193 Setting up own telemetry...
2025-03-04T11:25:42.415+0100 debug builders/builders.go:24 Stable component. {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.415+0100 debug builders/builders.go:24 Beta component. May change in the future. {"otelcol.component.id": "hostmetrics", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.415+0100 debug builders/extension.go:48 Beta component. May change in the future. {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:25:42.436+0100 info [email protected]/service.go:258 Starting otelcol-contrib... {"Version": "0.120.1", "NumCPU": 6}
2025-03-04T11:25:42.436+0100 info extensions/extensions.go:40 Starting extensions...
2025-03-04T11:25:42.436+0100 info extensions/extensions.go:44 Extension is starting... {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:25:42.436+0100 info extensions/extensions.go:61 Extension started. {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:25:42.454+0100 info exporterqueue/persistent_queue.go:157 Initializing new persistent queue {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.454+0100 debug exporterqueue/persistent_queue.go:396 Checking if there are items left for dispatch by consumers {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.454+0100 debug exporterqueue/persistent_queue.go:407 No items left for dispatch by consumers {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.454+0100 info [email protected]/service.go:281 Everything is ready. Begin running and processing data.
2025-03-04T11:25:43.456+0100 debug [email protected]/otlp.go:177 Preparing to make HTTP request {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "url": "http://localhost:9999/v1/metrics"}
^C2025-03-04T11:26:08.902+0100 info [email protected]/collector.go:339 Received signal from OS {"signal": "interrupt"}
2025-03-04T11:26:08.902+0100 info [email protected]/service.go:323 Starting shutdown...
2025-03-04T11:26:13.457+0100 info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "error": "failed to make an HTTP request: Post \"http://localhost:9999/v1/metrics\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "2.648668752s"}
2025-03-04T11:26:16.106+0100 debug [email protected]/otlp.go:177 Preparing to make HTTP request {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "url": "http://localhost:9999/v1/metrics"}
Killed
start again
2025-03-04T11:26:35.374+0100 info [email protected]/service.go:193 Setting up own telemetry...
2025-03-04T11:26:35.374+0100 debug builders/builders.go:24 Stable component. {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:26:35.374+0100 debug builders/builders.go:24 Beta component. May change in the future. {"otelcol.component.id": "hostmetrics", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics"}
2025-03-04T11:26:35.374+0100 debug builders/extension.go:48 Beta component. May change in the future. {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:26:35.399+0100 info [email protected]/service.go:258 Starting otelcol-contrib... {"Version": "0.120.1", "NumCPU": 6}
2025-03-04T11:26:35.399+0100 info extensions/extensions.go:40 Starting extensions...
2025-03-04T11:26:35.399+0100 info extensions/extensions.go:44 Extension is starting... {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:26:35.399+0100 info extensions/extensions.go:61 Extension started. {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:26:35.400+0100 debug exporterqueue/persistent_queue.go:396 Checking if there are items left for dispatch by consumers {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:26:35.400+0100 info exporterqueue/persistent_queue.go:411 Fetching items left for dispatch by consumers {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "numberOfItems": 1}
now you collector is stuck waiting for a free storage capacity signal but without having started the consumers
Additional context
Since https://github.com/open-telemetry/opentelemetry-collector/pull/13330, the putInternal is updated and no longer manages the queue metadata, and it doesn't do the capacity check anymore.
I've tried it with the latest release, v0.137.0, and the full persisted queue no longer blocks startup. I suppose we can close this issue if @TrogloGeek verifies that the issue has been fixed.