fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Fluentbit output reports chunk dropped in output metrics when there isn't any chunk dropped

Open ashish-kumar-glean opened this issue 4 months ago • 0 comments

Bug Report

Describe the bug This is kind of a continuation to https://github.com/fluent/fluent-bit/issues/6141 . The issue i linked points that fluentbit's s3 output metrics are useless and will have false positives (show success even when there are failures).

Creating this issue to point out that the metrics will have false negatives as well (show failure when things are fine).

Exact scenario: fluentbit_output_dropped_records_total and fluentbit_output_errors_total show dropped records and errors even when things are fine and there are no logs being dropped.

Symptoms: The exported fluentbit metrics show this:

Image

and there are these logs around the same time in fluentbit logs:

[2025/06/05 08:17:48] [error] [http_client] broken connection to s3.us-east-1.amazonaws.com:443 ?
[2025/06/05 08:17:48] [ info] [output:s3:s3.0] UploadPart http status=0
[2025/06/05 08:17:48] [error] [output:s3:s3.0] UploadPart: Could not parse response
[2025/06/05 08:17:48] [error] [output:s3:s3.0] UploadPart request failed
[2025/06/05 08:18:04] [ info] [output:s3:s3.0] UploadPart http status=200

There aren't any logs from s3 output that say "cannot retry chunk". (presence of these logs would indicate actual log loss)

The metric still gets updated because:

My understanding is that metrics work based on FLB_OK, FLB_RETRY, FLB_ERROR being returned from the output's flush method.

This is the piece of code that returns FLB_ERROR in s3 output: https://github.com/fluent/fluent-bit/blob/a8f3f507613883f8224af26a3d5d14f6b18b684f/plugins/out_s3/s3.c#L2223

s3_upload_queue(config, ctx);
            if (ctx->upload_queue_success == FLB_FALSE) {
                ctx->upload_queue_success = FLB_TRUE;
                FLB_OUTPUT_RETURN(FLB_ERROR);
            }

Note that if you look at the code carefully, by this time, the chunk is already buffered in the file, so should be on disk. looking at s3_upload_queue: it has this code: https://github.com/fluent/fluent-bit/blob/a8f3f507613883f8224af26a3d5d14f6b18b684f/plugins/out_s3/s3.c#L1719

ret = send_upload_request(ctx, NULL, upload_contents->upload_file,
                                  upload_contents->m_upload_file,
                                  upload_contents->tag, upload_contents->tag_len);
        if (ret < 0) {
            goto exit;
        }
        else if (ret == FLB_OK) {
            remove_from_queue(upload_contents);
            ctx->retry_time = 0;
            ctx->upload_queue_success = FLB_TRUE;
        }
        else {
            s3_store_file_lock(upload_contents->upload_file);
            ctx->upload_queue_success = FLB_FALSE;

So when the upload request fails, it returns -1 / FLB_FALSE based on the error type. and returns, which leads to an FLB_ERROR in cb_s3_flush function. This FLB_ERROR sent back from the FLUSH leads to the metric being updated, but the actual chunk is already stored in disk and will be retried later, so there wont be any logs dropped.

I guess we don't want to return FLB_RETRY because we dont want the engine to retry because the chunk has been successfully stored, and a retry on this chunk by the engine would lead to duplicates. So we return FLB_ERROR, but that in turn causes the metric to be counted even though no logs are dropped.

Expected behavior Metrics should be updated correctly.

Your Environment using fluentbit 3.2

ashish-kumar-glean avatar Jun 05 '25 15:06 ashish-kumar-glean