logstash icon indicating copy to clipboard operation
logstash copied to clipboard

Throughput debugging improvements: queue health status

Open yaauie opened this issue 4 years ago • 0 comments

When troubleshooting throughput issues, it is often helpful to determine whether the bottleneck is the input's ability to produce or the pipeline workers' collective ability to consume, because the location of the bottleneck determines our tuning options for optimizing performance:

  • when the pipeline's workers cannot keep up with the input(s), total throughput can often be improved by increasing the number of pipeline workers and/or the number of events per batch (at the cost of additional memory consumption).
  • when the inputs cannot produce events as fast as the pipeline workers process them, performance improvement strategies depend greatly on the "upstream" technologies and specific input plugins being used.

Currently, differentiating these two situations depends greatly on which queueing system is in use for the pipeline:

  • when the pipeline is using the (default) in-memory queue, inputs will spend time blocked waiting to push into the queue when the workers are unable to keep up (node stats API: ${pipeline}.events.queue_push_duration_in_millis)
  • inputs pushing to the persistent queue will not block until that queue is completely full, because the PQ can absorb some amount of back-pressure (node stats API: ${pipeline}.queue.data.events + ${pipeline}.events.queue_push_duration_in_millis)

Consider creating a new unified metric indicating the queue's health (perhaps ${pipeline}.queue.status), that allows each queue type to provide its own implementation-specific interpretation:

  • green: pipeline workers are generally able to keep up with the inputs
    • memory: no significant time is spent by inputs being blocked pushing to the queue
    • pq: unacked event count is low relative to recent throughput, and/or the most recent unacked item in the queue is within a certain timeframe
  • yellow: the inputs are consistently experiencing some back-pressure or have recently producing work faster than the outputs can consume it
    • memory: significant time is spent by the inputs blocked pushing to the queue
    • pq: unacked event count is consistently growing, and has not recently seen significant drops or the inputs are experiencing back-pressure
  • red: the inputs are spending the majority of their time being blocked pushing to the queue.

yaauie avatar May 27 '20 16:05 yaauie