opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

exporterqueue.asyncQueue won't start if persistant queue is full, recovered dispatched requests exists and sending queue is configured to blocking mode

Open TrogloGeek opened this issue 8 months ago • 1 comments
trafficstars

Component(s)

exporter/exporterqueue

What happened?

Describe the bug At start-up, an exporter queue configured with persistance and blocking behavior will hang up forever when it has to recover unfinished dispatching requests but not enough remaining queue capacity.

Steps to reproduce Configure an exporter with the following sending_queue settings:

  • persistent storage (ex: file_storage)
  • blocking: true

you may need to enable retry_on_failure as well depending on the troughputs.

Start the collector with a configuration where your exporter cannot export (eg: non answering remote), let the persistent queue fills up and then shutdown the collector while there is an active request export trial.

Start the collector again.

What did you expect to see? exporterqueue should start, eventually recovering everything in the queue and any unfinished request that was dispatching at shutdown.

What did you see instead? A completely deadlock pipeline because exporterqueue is waiting for persistent queue to have available storage capacity to recover unfinished requests while queue consumers have not yet been started.

exporterqueue.(*cond).Wait (cond.go:46) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).putInternal (persistent_queue.go:248) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).retrieveAndEnqueueNotDispatchedReqs (persistent_queue.go:444) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).initClient (persistent_queue.go:139) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).Start (persistent_queue.go:119) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*persistentQueue).Start (persistent_queue.go:114) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*asyncQueue).Start (async_queue.go:32) go.opentelemetry.io/collector/exporter/exporterqueue
exporterqueue.(*asyncQueue).Start (async_queue.go:31) go.opentelemetry.io/collector/exporter/exporterqueue
<autogenerated>:2
internal.(*QueueSender).Start (queue_sender.go:159) go.opentelemetry.io/collector/exporter/exporterhelper/internal
internal.(*BaseExporter).Start (base_exporter.go:137) go.opentelemetry.io/collector/exporter/exporterhelper/internal
<autogenerated>:2
<autogenerated>:2
graph.(*Graph).StartAll (graph.go:419) go.opentelemetry.io/collector/service/internal/graph
service.(*Service).Start (service.go:273) go.opentelemetry.io/collector/service
otelcol.(*Collector).setupConfigurationComponents (collector.go:239) go.opentelemetry.io/collector/otelcol
otelcol.(*Collector).Run (collector.go:293) go.opentelemetry.io/collector/otelcol
otelcol.NewCommand.func1 (command.go:39) go.opentelemetry.io/collector/otelcol
cobra.(*Command).execute (command.go:985) github.com/spf13/cobra
cobra.(*Command).ExecuteC (command.go:1117) github.com/spf13/cobra
cobra.(*Command).Execute (command.go:1041) github.com/spf13/cobra
main.runInteractive (main.go:55) main
main.run (main_others.go:10) main
main.main (main.go:48) main
runtime.main (proc.go:272) runtime
runtime.goexit (asm_amd64.s:1700) runtime
 - Async Stack Trace
<autogenerated>:2

Collector version

v1.26.0/v0.120.0

Environment information

Environment

OS: Ubuntu 20.04.1 Compiler(if manually compiled): go1.23.6

OpenTelemetry Collector configuration

extensions:
  file_storage/exporter:
    directory: ./data

receivers:
  hostmetrics:
    collection_interval: 100ms
    scrapers:
      cpu:

exporters:
  otlphttp:
    endpoint: http://localhost:9999 # nothing listening here
    sending_queue:
      enabled: true
      blocking: true
      num_consumers: 1
      queue_size: 5
      storage: file_storage/exporter
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0 # this will help having a queue full at shutdown

service:
  extensions: [file_storage/exporter]
  pipelines:
    metrics:
      receivers: [hostmetrics]
      exporters: [otlphttp]
  telemetry:
    logs:
      level: DEBUG

Log output

2025-03-04T11:25:42.415+0100    info    [email protected]/service.go:193 Setting up own telemetry...
2025-03-04T11:25:42.415+0100    debug   builders/builders.go:24 Stable component.       {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.415+0100    debug   builders/builders.go:24 Beta component. May change in the future.       {"otelcol.component.id": "hostmetrics", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.415+0100    debug   builders/extension.go:48        Beta component. May change in the future.       {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:25:42.436+0100    info    [email protected]/service.go:258 Starting otelcol-contrib...     {"Version": "0.120.1", "NumCPU": 6}
2025-03-04T11:25:42.436+0100    info    extensions/extensions.go:40     Starting extensions...
2025-03-04T11:25:42.436+0100    info    extensions/extensions.go:44     Extension is starting...        {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:25:42.436+0100    info    extensions/extensions.go:61     Extension started.      {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:25:42.454+0100    info    exporterqueue/persistent_queue.go:157   Initializing new persistent queue       {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.454+0100    debug   exporterqueue/persistent_queue.go:396   Checking if there are items left for dispatch by consumers      {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.454+0100    debug   exporterqueue/persistent_queue.go:407   No items left for dispatch by consumers {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:25:42.454+0100    info    [email protected]/service.go:281 Everything is ready. Begin running and processing data.
2025-03-04T11:25:43.456+0100    debug   [email protected]/otlp.go:177   Preparing to make HTTP request  {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "url": "http://localhost:9999/v1/metrics"}
^C2025-03-04T11:26:08.902+0100  info    [email protected]/collector.go:339       Received signal from OS {"signal": "interrupt"}
2025-03-04T11:26:08.902+0100    info    [email protected]/service.go:323 Starting shutdown...
2025-03-04T11:26:13.457+0100    info    internal/retry_sender.go:126    Exporting failed. Will retry the request after interval.        {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "error": "failed to make an HTTP request: Post \"http://localhost:9999/v1/metrics\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "2.648668752s"}
2025-03-04T11:26:16.106+0100    debug   [email protected]/otlp.go:177   Preparing to make HTTP request  {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "url": "http://localhost:9999/v1/metrics"}
Killed

start again

2025-03-04T11:26:35.374+0100    info    [email protected]/service.go:193 Setting up own telemetry...
2025-03-04T11:26:35.374+0100    debug   builders/builders.go:24 Stable component.       {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:26:35.374+0100    debug   builders/builders.go:24 Beta component. May change in the future.       {"otelcol.component.id": "hostmetrics", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics"}
2025-03-04T11:26:35.374+0100    debug   builders/extension.go:48        Beta component. May change in the future.       {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:26:35.399+0100    info    [email protected]/service.go:258 Starting otelcol-contrib...     {"Version": "0.120.1", "NumCPU": 6}
2025-03-04T11:26:35.399+0100    info    extensions/extensions.go:40     Starting extensions...
2025-03-04T11:26:35.399+0100    info    extensions/extensions.go:44     Extension is starting...        {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:26:35.399+0100    info    extensions/extensions.go:61     Extension started.      {"otelcol.component.id": "file_storage/exporter", "otelcol.component.kind": "Extension"}
2025-03-04T11:26:35.400+0100    debug   exporterqueue/persistent_queue.go:396   Checking if there are items left for dispatch by consumers      {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics"}
2025-03-04T11:26:35.400+0100    info    exporterqueue/persistent_queue.go:411   Fetching items left for dispatch by consumers   {"otelcol.component.id": "otlphttp", "otelcol.component.kind": "Exporter", "otelcol.signal": "metrics", "numberOfItems": 1}

now you collector is stuck waiting for a free storage capacity signal but without having started the consumers

Additional context

TrogloGeek avatar Mar 03 '25 12:03 TrogloGeek

Since https://github.com/open-telemetry/opentelemetry-collector/pull/13330, the putInternal is updated and no longer manages the queue metadata, and it doesn't do the capacity check anymore.

I've tried it with the latest release, v0.137.0, and the full persisted queue no longer blocks startup. I suppose we can close this issue if @TrogloGeek verifies that the issue has been fixed.

namco1992 avatar Oct 15 '25 07:10 namco1992