fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

io: add connection backoff

Open kabakaev opened this issue 4 years ago • 3 comments

In normal operation, fluent-bit reuses TCP connections, hence new messages are flushed without sending a TCP SYN. But if an output connection cannot be established, then each flb_io_net_write() call will trigger connection setup and will send a series of TCP SYN packets (one per thread?).

The actual issue is described in #3103.

We observed this issue when hundreds of fluent-bit agents tried to send logs via forward to a set of receiving fluent-bits, which were all down due to config error. The receiving FLB was hosted behind an openstack load balancer, a Linux stateful firewall and a traefik ingress controller.

Apart from high load, the flood of SYN packets may exhaust the connection tracking table, impacting the whole network infrastructure.

Fixes #3103.

This PR is inspired by GRPC backoff implementation.

Backoff is disabled by default.

If enabled, backoff will limit the number of TCP SYN packets during an output destination outage (raw data): image

This chart shows rate of TCP SYN packets. The data is collected by tcpdump as described in How to test section below.

Testing

Example of backoff configuration is given below.

Valgrind output is uploaded to my gist.

Documentation

Documentation for this feature is submitted as docs PR491.

How to test

Compile this version:

cd build
cmake -DFLB_RELEASE=On \
          -DFLB_TRACE=On \
          -DFLB_JEMALLOC=On \
          -DFLB_TLS=On \
          -DFLB_SHARED_LIB=Off \
          -DFLB_EXAMPLES=Off \
          -DFLB_HTTP_SERVER=On \
          -DFLB_IN_SYSTEMD=On ../ \
&& make -j$(getconf _NPROCESSORS_ONLN)

Collect SYN packets without backoff

Simulate connection timeout and run tcpdump on a separate console:

iptables -I INPUT -p tcp --dport 24224 -j DROP
timeout 6m tcpdump -lnn -i any dst host 127.0.0.1 and dst port 24224 and tcp[tcpflags] == tcp-syn | tee run5m_backoff0.tcpdump
# 1802 packets captured
iptables -D INPUT -p tcp --dport 24224 -j DROP

and start fluent-bit without backoff settings:

timeout -s SIGKILL 5m \
  bin/fluent-bit -vv \
    -i dummy -p 'rate=1000000' \
    -o forward://127.0.0.1:24224 -p 'retry_limit=1' \
  2>&1 | tee run5m_backoff0.log
# Fluent Bit v1.8.0
# ...
# [2021/03/08 18:27:32] [ warn] [engine] failed to flush chunk '869680-1615224452.505695198.flb', retry in 6 seconds: task_id=189, input=dummy.0 > output=forward.0 (out_id=0)
# Killed

Collect SYN packets with initial backoff of 1 second

Simulate connection timeout and run tcpdump on a separate console:

iptables -I INPUT -p tcp --dport 24224 -j DROP
timeout 6m tcpdump -lnn -i any dst host 127.0.0.1 and dst port 24224 and tcp[tcpflags] == tcp-syn | tee run5m_backoff1.tcpdump
# 360 packets captured
iptables -D INPUT -p tcp --dport 24224 -j DROP

and start fluent-bit with backoff settings:

timeout -s SIGKILL 5m \
  bin/fluent-bit -vv \
    -i dummy -p 'rate=1000000' \
    -o forward://127.0.0.1:24224 -p 'retry_limit=1' -p 'net.backoff_init=1' -p 'net.backoff_max=60' \
  2>&1 | tee run5m_backoff1.log
# Fluent Bit v1.8.0
# ...
# [2021/03/08 18:27:31] [debug] [upstream] skipping connection to 127.0.0.1:24224 because of connection backoff for another 28 seconds
# [2021/03/08 18:27:32] [ warn] [engine] failed to flush chunk '869680-1615224452.505695198.flb', retry in 6 seconds: task_id=189, input=dummy.0 > output=forward.0 (out_id=0)
# Killed

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Alexander Kabakaev [email protected], Daimler TSS GmbH, imprint

kabakaev avatar Mar 09 '21 08:03 kabakaev

minor changes are requested.

@edsiper, thanks for quick review! The suggested changes are implemented. PTAL

kabakaev avatar Apr 07 '21 12:04 kabakaev

@kabakaev can you pls fix the conflicts so we can do final review/merge ?

edsiper avatar Dec 12 '21 23:12 edsiper

Hi @kabakaev, can you please review the requested changes? Thanks!

lecaros avatar Feb 11 '22 14:02 lecaros

@kabakaev would you mind resolving the conflicts here?

eschabell avatar Oct 23 '25 13:10 eschabell