logstash icon indicating copy to clipboard operation
logstash copied to clipboard

[Meta]DLQ age retention mechanism introduction and FIFO fixes

Open andsel opened this issue 2 years ago • 0 comments

Overview

This issue describes the introduction of an age retention policy in DLQ, plus a change in the way it behaves in case of queue full condition. The age retention, once set, drop all messages older then a certain amount of time, independently that the queue in full is not. In the actual implementation of the DLQ, when the queue full condition is reached, the queue drops new events, logging an error. This behavior should be configured to be like a FIFO in case the age retention is requested.

DLQ size reduction strategy

DLQ is composed of segment files, and an event can't cross multiple segments. An effective way to reduce the size or respect an age based policy, is to free space simply removing segment files. While the writer is not impacted by this, the reader side could see some discontinuity in the flow of events. Suppose a reader is accessing a segment while the queue full condition is met. The writer, or some other entity in case of age retention, kicks in removing some segment files. The reader part of the DLQ must cope with this condition, forwarding the read pointer to the next available tail segment. In terms of concurrency, between reader and writer, there aren't big concerns because the reader part can only access sealed segments, segments that the writer has finished to touch.

Enforcement of FIFO behavior

The actual behavior of DLQ, when it reaches dead_letter_queue.max_bytes is to drop new events. This trade newer events for older ones, implicitly assign more value to older events. The FIFO behavior for DLQ should be explicitly enabled by the user, when the queue full condition is reached, then older events are dropped.

Age retention policy

The age retention policy, could be enabled by configuring an age period in the option dead_letter_queue.retain.age and could be expressed in various time unit: days, hours, minutes, seconds. It's checked every time an event is inserted into the DLQ, checks means to verify if the oldest segment file contains only events older than the retention period; if this condition is satisfied then the segment file is deleted. The policy is also checked one time a day, this is necessary because a queue could not receive events for many days, and the policy has to be enforced also if there no events flowing into.

Metrics

Some metrics has to be inserted to make evident how the DLQ behaves, and to inform the user how many data has been deleted by the age retention policy application.

  • Storage policy (FIFO, drop newer events)
  • Retention Policy configuration
  • Max Size of the DLQ
  • Current Size of the DLQ
  • Retention Policy (Metrics on the number of events being dropped)

Limitations

Due to the removal of segment files, the read part of the DLQ has to cope with this logic, it means that the reader of a DLQ must use a Logstash version that handle this feature, so ideally producer and consumer side should run same version of Logstash.

Configuration settings

  • dead_letter_queue.storage_policy: by default is drop_newer the actual behavior. To enable FIFO use fifo.
  • dead_letter_queue.retain.age: if populated contains the age keep the events in the DLQ.

Implementation plan

  • [x] introduce a flag to enable the FIFO behavior and drop tail segment file when the queue full condition is reached. Adapt also the reader part to cope with this new behavior. #13923
  • [x] implement the age retention policy #14255
  • [x] expose error counters and error string, as requested in #14010 PR #14058
  • [x] Add documentation

andsel avatar Mar 18 '22 11:03 andsel

Released with 8.4.0

andsel avatar Aug 17 '22 14:08 andsel