fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Out forward stuck establishing connection

Open waffleisreal1339 opened this issue 4 months ago • 8 comments

Describe the bug

I encountered a critical issue with Fluentd where it gets stuck in an infinite loop while establishing connections between out_forward and in_forward servers. Our infrastructure has unstable connection that may lead to this bug occurring on bigger frequency. This bug has led to Fluentd ceasing to send data (and heartbeat) even when the underlying network infrastructure was back online. Interestingly, killing the socket resolved this issue and cause the thread to continue sending data.

The most simple way i can describe the bug is that fluentd suddenly stops sending any data to the remote server, no errors in the logs, and if i check inside the container i can see there is a open socket on "ESTABLISHED" to the remote server.

After some digging in the code i think i might have found the reason for this behaviour. Every heartbeat/write function, a new socket is created and then connect(dest_addr) is called, doing the TLS handshake. After that establish_connection is called to do the hello, ping, pong protocol. it seems that if after the TLS handshake, when we reach establish_connection, the connection drops between the fluentd and destination server, fluentd doesn't seem to detect or timeout the socket and as far as I can tell get stuck in this loop, where socket is still considered up, but will never contain any data.

   # in lib/fluent/plugin/out_forward.rb
   def establish_connection(sock, ri)
        while ri.state != :established
          begin
            # TODO: On Ruby 2.2 or earlier, read_nonblock doesn't work expectedly.
            # We need rewrite around here using new socket/server plugin helper.
            buf = sock.read_nonblock(@sender.read_length)
            if buf.empty?
              sleep @sender.read_interval
              next
            end
            Fluent::MessagePackFactory.msgpack_unpacker.feed_each(buf) do |data|
              if @handshake.invoke(sock, ri, data) == :established
                @log.debug "connection established", host: @host, port: @port
              end
            end

It's important to take note that in our infrastructure, the out_forward and in_forward servers doesn't connect directly to each other, they has a few components in-between them, so if the overall connection drops it doesn't necessarily mean that the socket will drop, so we have to rely on options like timeouts.

It would seems that socketops RCVTIMEO/SNDTIMEO should cause an exception to be raised, breaking the loop, but as far as i can tell in ruby, those socket options seems to be problematic. https://stackoverflow.com/questions/9853516/set-socket-timeout-in-ruby-via-so-rcvtimeo-socket-option#:~:text=Based%20on%20my%20testing%2C%20and%20Jesse%20Storimer%27s%20excellent%20ebook%20on%20%22Working%20with%20TCP%20Sockets%22%20(in%20Ruby)%2C%20the%20timeout%20socket%20options%20do%20not%20work%20in%20Ruby%201.9%20(and%2C%20I%20presume%202.0%20and%202.1).%20Jesse%20says%3A https://www.ruby-forum.com/t/how-to-set-socket-option-so-rcvtimeo/77758/8 https://stackoverflow.com/questions/57895907/ruby-2-5-set-timeout-on-tcp-sockets-read

To Reproduce

I'm not sure how to consistently reproduce the issue, it seem to depend on when the connection was dropped and where the establish_connection is currently at.

Expected behavior

I expect the socket to timeout if it failed to establish connection after some time. Perhaps using IO.select.

Your Environment

- Fluentd version:1.16.2

Your Configuration

<match **>
          @type forward
          transport tls
          tls_allow_self_signed_cert true
          tls_cert_path /opt/bitnami/fluentd/certs/tls/fluentd.crt
          require_ack_response true
          <server>
            host ****
            port 24224
          </server>
          <buffer>
            @type file
            path *****
            total_limit_size 4096MB
            chunk_limit_size 16MB
            flush_mode interval
            flush_interval 5s
            flush_thread_count 8
            retry_type periodic
            retry_wait 10s
            retry_forever true
            queued_chunks_limit_size 8
          </buffer>
        </match>

Your Error Log

The logs don't contain to much information, it mostly stops printing any logs.
For the time being i also cannot supply them because of PPI information, ill sanitise, and attach them once i can.
Also I turned trace level logging on my instances for later occurrences of the bug.

Additional context

No response

waffleisreal1339 avatar Feb 06 '24 15:02 waffleisreal1339