fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

docker_events: when retry_limits is -1, retries fail to wait for retry_interval

Open zarqman opened this issue 1 year ago • 2 comments

Bug Report

Describe the bug

The docker_events input fails to wait for retry_interval between retries when retry_limits is -1. Retries are attempted as fast as fluent-bit can go.

To Reproduce

fluent-bit.conf

[SERVICE]
    log_level           info
    flush               1
    daemon              off

[INPUT]
    name                docker_events
    tag                 dockerd
    unix_path           /run/docker.sock
    reconnect.retry_limits      -1
    reconnect.retry_interval    1

[OUTPUT]
    name    stdout
    match   *
Logs:
Aug 29 22:17:27 test1 fluent-bit[88337]: [2024/08/29 22:17:26] [ info] [input:docker_events:docker_events.0] EOF detected. Re-initialize
Aug 29 22:17:27 test1 fluent-bit[88337]: [2024/08/29 22:17:26] [ info] [input:docker_events:docker_events.0] EOF detected. Re-initialize

There are 100000+ of these per second (fluent-bit is very fast 😄 ). These logs came from systemd (via journalctl -fu fluent-bit.service).

Steps to reproduce the problem:

With both fluent-bit and docker already running, restart docker: systemctl restart docker.

Expected behavior

Retries should be spaced out according to reconnect.retry_interval.

It appears the code presently bypasses create_reconnect_event() when retry_limits <= 0. My initial guess is that instead of bypassing the reconnect loop/event, that loop might always be needed and should just never give up retrying.

Your Environment

Fluent v3.1.6, via the official .deb package, running via the default systemd service definition. Docker 27.2.0 via their official .deb package, also running via systemd.

  • Version used: 3.1.6
  • Configuration: see above
  • Environment name and version: n/a
  • Server type and version: VM
  • Operating System and version: Debian 12
  • Filters and plugins: see config

Additional context

My understanding is that reconnect.retry_limits -1 is a valid way to say 'retry for an unlimited number of retries'. My goal is to retry at a measured pace, but as long as necessary to reconnect. I never want docker_events to just give up.

I am unsure if the rapid loop is somehow partly related to systemd's handling of /run/docker.sock. However, setting retry_limits to a positive integer results in a very different log:

Aug 29 22:23:56 test1 fluent-bit[88651]: [2024/08/29 22:23:56] [ info] [input:docker_events:docker_events.0] EOF detected. Re-initialize
Aug 29 22:24:02 test1 fluent-bit[88651]: [2024/08/29 22:24:02] [error] [/tmp/fluent-bit/plugins/in_docker_events/docker_events.c:69 errno=104] Connection reset by peer
Aug 29 22:24:02 test1 fluent-bit[88651]: [2024/08/29 22:24:02] [ info] [input:docker_events:docker_events.0] Reconnect successful
Aug 29 22:24:02 test1 fluent-bit[88651]: [2024/08/29 22:24:02] [ info] [input:docker_events:docker_events.0] EOF detected. Re-initialize
Aug 29 22:24:02 test1 fluent-bit[88651]: [2024/08/29 22:24:02] [error] [/tmp/fluent-bit/plugins/in_docker_events/docker_events.c:57 errno=111] Connection refused
Aug 29 22:24:02 test1 fluent-bit[88651]: [2024/08/29 22:24:02] [error] [input:docker_events:docker_events.0] failed to re-initialize socket
Aug 29 22:24:02 test1 fluent-bit[88651]: [2024/08/29 22:24:02] [ info] [input:docker_events:docker_events.0] create reconnect event. interval=1 second
Aug 29 22:24:03 test1 fluent-bit[88651]: [2024/08/29 22:24:03] [ info] [input:docker_events:docker_events.0] Retry(1/5)
Aug 29 22:24:03 test1 fluent-bit[88651]: [2024/08/29 22:24:03] [error] [/tmp/fluent-bit/plugins/in_docker_events/docker_events.c:57 errno=111] Connection refused
Aug 29 22:24:03 test1 fluent-bit[88651]: [2024/08/29 22:24:03] [error] [input:docker_events:docker_events.0] failed to re-initialize socket
Aug 29 22:24:03 test1 fluent-bit[88651]: [2024/08/29 22:24:03] [ info] [input:docker_events:docker_events.0] Failed. Waiting for next retry..
[..continues for specified number of retries..]

zarqman avatar Aug 29 '24 23:08 zarqman

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Dec 15 '24 02:12 github-actions[bot]

bump

zarqman avatar Dec 15 '24 02:12 zarqman

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Mar 25 '25 02:03 github-actions[bot]

bump

zarqman avatar Mar 25 '25 05:03 zarqman

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Jun 24 '25 02:06 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Jun 29 '25 02:06 github-actions[bot]