fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Fluent Bit retry strategy unnecessarily loses logs with Loki after outage

Open jtackaberry opened this issue 1 year ago • 2 comments
trafficstars

Is your feature request related to a problem? Please describe.

Consider the scenario when Fluent Bit is configured to buffer to storage and uses Loki as an output.

First, it's necessary to understand a key limitation of Loki. This blog post explains how Loki handles out-of-order delivery, and the relevant bit is in the Out-of-order window section:

In the opening I mentioned that the window for out-of-order data is:

max_chunk_age / 2

max_chunk_age can be found in the ingester config. The default here is two hours. I strongly discourage increasing this value to create a larger out-of-order window.

This means the effective window for out-of-order delivery of a properly configured Loki deployment (including Grafana Cloud) is 1 hour.

Fluent Bit's storage buffering in principle allows it to tolerate Loki outages greater than 1 hour. But this doesn't work in practice.

After Loki becomes available again, Fluent Bit resumes sending recent data and begins delivering older buffered data concurrently as part of the scheduler's retries. As soon as Loki receives recent/current data, the out-of-order window is shifted forward. Only chunks containing data within an hour of the most recent delivered timestamp will be accepted. Everything else will be rejected.

Describe the solution you'd like

Ideally, Fluent Bit would be configurable such that:

  • In the face of an output (Loki) failure, it will continue to accept input and buffer to storage (as today)
  • When the output begins functioning again (Loki outage has ended), continue to buffer recent data to storage but do not send it to Loki. Instead, retry the oldest set of chunks within some configurable window of time. (In Loki's case, that window is max 1 hour, but I'd probably configure it to 30 minutes for a safety margin.)
  • Only start sending current data being inputted once all chunks older than the last-seen timestamp minus the configurable window have been delivered, or they have been abandoned due to reaching their retry limit.
  • Ideally once the Loki outage has ended, Fluent Bit replays older logs as fast as possible to catch up to the window where it's possible to resume sending current logs (while respecting HTTP 429). This may necessitate additional configuration and logic in the scheduler to differentiate between the cases where the output is having issues and the output has fully recovered and is ingesting at full rate. In the latter case I would want Fluent Bit to replay much more quickly than normal retries which use exponential backoff.

AFAICT this requires some critical changes to Fluent Bit's behavior:

  • There are conditions where some data (recent data in particular) will not be attempted to deliver to Loki. In this case the condition is that there are pending chunks older than the configurable window.
  • Not all pending chunks are eligible for retry at any given moment. Only the chunks that contain data between the oldest undelivered timestamp and the configurable window would be eligible.
  • New scheduler and/or output behavior to more aggressively retry old chunks after the output is healthy again.

(I admit I'm not entirely clear on how much of this is up to the scheduler and how much is up to the output. Apologies if I've made some incorrect assumptions.)

Describe alternatives you've considered

I'm not aware of any alternatives that don't involve swapping Fluent Bit for e.g. Grafana Agent.

Additional context

Outages in our Loki platform or misconfigurations that cause Loki to reject logs from Fluent Bit have caused us to lose log data. The aim here is to avoid this data loss by holding off on recent data until Fluent Bit has a chance to catch up to realtime in a more linear fashion (linear in the sense that a configurably large sliding window progresses forward).

jtackaberry avatar Aug 28 '24 18:08 jtackaberry

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Dec 15 '24 02:12 github-actions[bot]

Not stale

jtackaberry avatar Dec 15 '24 02:12 jtackaberry

We're experiencing same (or similar) issue where loki loses a lot of logs (((

ixti avatar Jan 22 '25 05:01 ixti

This would be great!

With the current behaviour, I can't see how we could recover from a Loki outage of more than ~60 minutes. During evaluation of fluent-bit and Loki, we often had the issue of something breaking over the weekend, fluent-bit filling up its buffers, but Loki discarding most of the old logs.

The (unmaintained) Fluent-Bit Loki Community Plugin supports using dque to maintain order of events.

outofrange avatar Mar 03 '25 13:03 outofrange

I just noticed that Loki 3.4.0, released about a month ago, now supports Time sharding ingestion - although it's not enabled by default. Time sharding adds a sharding label for all incoming logs older than 40m, to group them into blocks not larger than max_chunk_age / 2, thus avoiding out-of-order issues. Sounds like a very good idea to me, but I definitely want to test if my ingester can keep up with all the new (temporary) streams after longer outages without running into OOM.

Maybe it's worth having something in out_loki docs, talking about ordering done by fluent-bit, and recommending enabling time sharding?

@jtackaberry do you think this would solve the issue for you as well?

outofrange avatar Mar 05 '25 12:03 outofrange

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Jun 04 '25 02:06 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Jun 09 '25 02:06 github-actions[bot]