fluentd
fluentd copied to clipboard
Out forward stuck establishing connection
Describe the bug
I encountered a critical issue with Fluentd where it gets stuck in an infinite loop while establishing connections between out_forward and in_forward servers. Our infrastructure has unstable connection that may lead to this bug occurring on bigger frequency. This bug has led to Fluentd ceasing to send data (and heartbeat) even when the underlying network infrastructure was back online. Interestingly, killing the socket resolved this issue and cause the thread to continue sending data.
The most simple way i can describe the bug is that fluentd suddenly stops sending any data to the remote server, no errors in the logs, and if i check inside the container i can see there is a open socket on "ESTABLISHED" to the remote server.
After some digging in the code i think i might have found the reason for this behaviour.
Every heartbeat/write function, a new socket is created and then connect(dest_addr)
is called, doing the TLS handshake.
After that establish_connection
is called to do the hello, ping, pong protocol.
it seems that if after the TLS handshake, when we reach establish_connection
, the connection drops between the fluentd and destination server, fluentd doesn't seem to detect or timeout the socket and as far as I can tell get stuck in this loop, where socket is still considered up, but will never contain any data.
# in lib/fluent/plugin/out_forward.rb
def establish_connection(sock, ri)
while ri.state != :established
begin
# TODO: On Ruby 2.2 or earlier, read_nonblock doesn't work expectedly.
# We need rewrite around here using new socket/server plugin helper.
buf = sock.read_nonblock(@sender.read_length)
if buf.empty?
sleep @sender.read_interval
next
end
Fluent::MessagePackFactory.msgpack_unpacker.feed_each(buf) do |data|
if @handshake.invoke(sock, ri, data) == :established
@log.debug "connection established", host: @host, port: @port
end
end
It's important to take note that in our infrastructure, the out_forward and in_forward servers doesn't connect directly to each other, they has a few components in-between them, so if the overall connection drops it doesn't necessarily mean that the socket will drop, so we have to rely on options like timeouts.
It would seems that socketops
RCVTIMEO
/SNDTIMEO
should cause an exception to be raised, breaking the loop, but as far as i can tell in ruby, those socket options seems to be problematic.
https://stackoverflow.com/questions/9853516/set-socket-timeout-in-ruby-via-so-rcvtimeo-socket-option#:~:text=Based%20on%20my%20testing%2C%20and%20Jesse%20Storimer%27s%20excellent%20ebook%20on%20%22Working%20with%20TCP%20Sockets%22%20(in%20Ruby)%2C%20the%20timeout%20socket%20options%20do%20not%20work%20in%20Ruby%201.9%20(and%2C%20I%20presume%202.0%20and%202.1).%20Jesse%20says%3A
https://www.ruby-forum.com/t/how-to-set-socket-option-so-rcvtimeo/77758/8
https://stackoverflow.com/questions/57895907/ruby-2-5-set-timeout-on-tcp-sockets-read
To Reproduce
I'm not sure how to consistently reproduce the issue, it seem to depend on when the connection was dropped and where the establish_connection
is currently at.
Expected behavior
I expect the socket to timeout if it failed to establish connection after some time.
Perhaps using IO.select
.
Your Environment
- Fluentd version:1.16.2
Your Configuration
<match **>
@type forward
transport tls
tls_allow_self_signed_cert true
tls_cert_path /opt/bitnami/fluentd/certs/tls/fluentd.crt
require_ack_response true
<server>
host ****
port 24224
</server>
<buffer>
@type file
path *****
total_limit_size 4096MB
chunk_limit_size 16MB
flush_mode interval
flush_interval 5s
flush_thread_count 8
retry_type periodic
retry_wait 10s
retry_forever true
queued_chunks_limit_size 8
</buffer>
</match>
Your Error Log
The logs don't contain to much information, it mostly stops printing any logs.
For the time being i also cannot supply them because of PPI information, ill sanitise, and attach them once i can.
Also I turned trace level logging on my instances for later occurrences of the bug.
Additional context
No response