fluent-bit Forward could not accept new connection/tls unexpected EOF errors

Bug Report

Describe the bug Fluent-bit produces a large number of TLS/connection errors in its logs when TLS is enabled with the forwarding input plugin.

The use case is one instance of fluent-bit running inside EC2 outputting logs to a receiver fluent-bit instance running inside a kube cluster to securely forward messages into graylog.

Observations:

Turning TLS verification off - still errors
Running in debug mode I get further error messages e.g. [debug] [downstream] connection #51 failed, [debug] [socket] could not validate socket status for #52 (don't worry)
Turning TLS off resolves the errors (predictably)
Most messages are still making it to the receiver/server fluent-bit as far as I can see, haven't yet identified if messages are being lost or not

To Reproduce Example log messages:

[2023/05/17 14:21:40] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:40] [error] [tls] error: unexpected EOF
[2023/05/17 14:21:40] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:41] [error] [tls] error: unexpected EOF
[2023/05/17 14:21:41] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:41] [error] [tls] error: unexpected EOF

Occasionally:

[2023/05/17 14:21:42] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=0] Success
[2023/05/17 14:21:42] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

Steps to reproduce the problem:

Run receiving fluent-bit in kube
Pod certificate issued by cert-manager vault issuer
Second external fluent-bit (e.g. in EC2) sending messages to the receiving fluent-bit with the vault ca_chain in config

Expected behavior Fluent-bit doesn't spew TLS errors

Your Environment

Version used: 2.1 (also 2.0.6)
Configuration: Fluent-bit helm chart running in kube fluent-bit.conf (server/receiver side - filters removed for simplicity)

 [SERVICE]
    Daemon off
    Flush 10
    Log_Level debug
    parsers_file custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On
    storage.path /var/logs/flb-logs
    storage.sync full

[INPUT]
    name forward
    listen 0.0.0.0
    port 24224
    tls on
    tls.debug 4
    tls.verify on
    tls.crt_file /etc/tls/fluent-bit-ingress-tls/tls.crt
    tls.key_file /etc/tls/fluent-bit-ingress-tls/tls.key
    storage.type filesystem

[OUTPUT]
    Name                    gelf
    Match                   *
    Host                    ~URL omitted~
    Port                    12212
    Mode                    tls
    tls                     On
    tls.verify              Off
    tls.ca_file             /fluent-bit/etc/ca.crt
    tls.vhost               ~URL omitted~
    Gelf_Short_Message_Key  message
    Gelf_Host_Key           container_name
    storage.total_limit_size 256MB

fluent-bit.conf (client/sender side - filters removed for simplicity):

[SERVICE]
    parsers_file /fluent-bit/etc/parsers.conf

[INPUT]
    name forward
    listen 0.0.0.0
    port 24224

[OUTPUT]
    Name stdout
    Format json_lines
    Match OUTPUT

[OUTPUT]
    Name forward
    Match OUTPUT
    Host ~URL omitted~
    Port 24224
    tls on
    tls.verify on
    tls.ca_file /etc/fluent-bit/ca.crt

Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.23 (EKS)
Server type and version:
Operating System and version: Fedora CoreOS EC2 instance
Filters and plugins: Forward input/outputs gelf output

Additional context

May 18 '23 13:05 rossbishop

Did I understand you correctly that even though you are getting those errors data is flowing? If that's the case then I wonder if that's due to time slice shifting. Could you try these two things individually and then combined?

Add threaded on to the input plugin (forward)
Add workers 1 to the output plugin (gelf)

That should greatly alleviate the pressure on the main thread and could give us some valuable insight.

May 18 '23 18:05 leonardo-albertovich

Hi Leonardo,

Thanks for the prompt reply, apologies for the delay I had a long weekend off!

So, I tried:

Both threaded on and workers 1 set in the input and output plugins respectively
Just threaded on set in the input plugin
Just workers 1 set in the output plugin

I only tried this on the server/receiver side; I'm still experiencing the same errors

May 23 '23 10:05 rossbishop

In that case the only thing that comes to mind is using kubeshark to capture the traffic which would let us know if those connection attempts are being aborted by the remote host due to a delayed handshake attempt or what exactly is going on.

If you decide to capture the traffic you can share those pcaps in private with me in slack. I'll look at them, give you some feedback and try to come up with the next step.

May 23 '23 11:05 leonardo-albertovich

Was there any resolution on this? I'm seeing the same thing for fluent-bit running in a Nomad environment. We're presently running version 2.1.4. I turned off all outputs to minimize the configuration.

Here is the INPUT configuration:

[INPUT]
    Name           forward
    Listen         0.0.0.0
    port           24224
    threaded       on
    tls            on
    tls.debug      4
    tls.verify     off
    tls.ca_file    /fluent-bit/etc/ca.cert.pem
    tls.crt_file   /fluent-bit/etc/devl.cert.pem
    tls.key_file   /fluent-bit/etc/devl.key.pem

Here is a sample of the log output:

[2023/06/07 21:44:35] [error] [tls] error: unexpected EOF
[2023/06/07 21:44:35] [debug] [downstream] connection #55 failed
[2023/06/07 21:44:35] [error] [input:forward:forward.0] could not accept new connection

Jun 07 '23 21:06 zenchild

Disregard my issue. I found that I had my local nomad logger, logging to fluentbit as well and that does not support TLS. That was the source of my errors. Once I added a non-TLS port for that traffic, the errors cleared up.

Jun 08 '23 14:06 zenchild

I saw the same issue. It seems like fluent-bit's throughput just becomes lower when enabling tls in forward input. As a result, forward output creates so many connections for newer chunks because existing connections are still used, then forward input becomes refusing newer connections by could not accept new connection error.

To prevent creating a large number of connections, set net.max_worker_connections to 20 or something in forward input, which was introduced in 2.1.6. But it might also cause no upstream connections available error in forward input .

https://docs.fluentbit.io/manual/administration/networking#max-connections-per-worker

Jul 12 '23 09:07 ksauzz

Running into similar issues on 2.1.8 with TLS. Using default docker container, fluentd forwarding to fluentbit.

Config

[SERVICE]
    log_level    debug

[INPUT]
    Name              forward
    Listen            0.0.0.0
    Port              24002
    Buffer_Chunk_Size 1M
    Buffer_Max_Size   6M
    tls on
    tls.verify off
    tls.crt_file /fluent-bit/etc/self_signed.crt
    tls.key_file /fluent-bit/etc/self_signed.key

# [OUTPUT]
#     Name stdout
#     Match *

[OUTPUT]
    Name        kafka
    Match       *
    Brokers     kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics      kubernetes-main-ingress
    Timestamp_Format iso8601

[2023/08/06 10:05:51] [debug] [out flush] cb_destroy coro_id=5
[2023/08/06 10:05:51] [debug] [task] destroy task=0x7fe6a2039aa0 (task_id=0)
[2023/08/06 10:05:52] [debug] [socket] could not validate socket status for #41 (don't worry)
[2023/08/06 10:05:53] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:05:55] [debug] [socket] could not validate socket status for #41 (don't worry)
[2023/08/06 10:05:56] [debug] [socket] could not validate socket status for #40 (don't worry)
[2023/08/06 10:06:01] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:06:01] [debug] [socket] could not validate socket status for #44 (don't worry)
[2023/08/06 10:06:02] [debug] [input chunk] update output instances with new chunk size diff=34495, records=28, input=forward.0
[2023/08/06 10:06:02] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:06:02] [debug] [task] created task=0x7fe6a2039780 id=0 OK
[2023/08/06 10:06:03] [debug] [socket] could not validate socket status for #44 (don't worry)
{"stream"=>"[2023/08/06 10:06:03] [debug] in produce_message

Log level error

Fluent Bit v2.1.8
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/08/06 10:29:47] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:47] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/06 10:29:48] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:48] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/06 10:29:51] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:51] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

Related: https://github.com/fluent/fluent-bit/issues/6690

Aug 06 '23 10:08 sandervandegeijn

Any updates on this? This is killing my production environment performance :)

Aug 13 '23 10:08 sandervandegeijn

@ict-one-nl could you paste the fluentd config, mainly the match statement

Aug 21 '23 13:08 agup006

I asked around, is this what you were asking for?

<label @xxxxx>
  <match kubernetes.**>
    @type tag_normaliser
    @id flow:xxxxx:xxxxx:0
    format ${namespace_name}.${pod_name}.${container_name}
  </match>
  <filter **>
    @type parser
    @id flow:xxxx:xxxxx:1
    key_name message
    remove_key_name_field true
    reserve_data true
    <parse>
      @type json
    </parse>
  </filter>
  <match **>
    @type forward
    @id flow:xxxx:xxxx:output:xxxx:xxxxx-logging
    tls_allow_self_signed_cert true
    tls_insecure_mode true
    transport tls
    <buffer tag,time>
      @type file
      chunk_limit_size 8MB
      path /buffers/flow:xxx:xxx:output:xxx:xxxxxx.*.buffer
      retry_forever true
      timekey 10m
      timekey_wait 1m
    </buffer>
    <server>
      host xxxxxxxx.nl
      port 24002
    </server>
  </match>
</label>

Aug 22 '23 12:08 sandervandegeijn

Thanks @ict-one-nl , I'm wondering if the buffer size needs to be larger on the Fluent Bit side to match what you have there with 8MB chunk limit. You may want to try and lower that on Fluentd side as well

Aug 23 '23 05:08 agup006

I have tried the larger buffer size:

[SERVICE]
    log_level                       error

[INPUT]
    Name                            forward
    Listen                          0.0.0.0
    Port                            24002
    Buffer_Chunk_Size               8M
    Buffer_Max_Size                 128M
    tls                             on
    tls.verify                      off
    tls.crt_file                    /fluent-bit/etc/self_signed.crt
    tls.key_file                    /fluent-bit/etc/self_signed.key


[OUTPUT]
    Name                            kafka
    Match                           *
    Brokers                         kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics                          kubernetes-main-ingress
    Timestamp_Format                iso8601

# [OUTPUT]
#     Name                            stdout
#     Match                           *

[2023/08/23 14:52:31] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:34] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:34] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:36] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:36] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:38] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:38] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:44] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:44] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

Same result. Will try the lower chunk size as well

Aug 23 '23 14:08 sandervandegeijn

FluentD config:

<label @3131968136fc14962a3d7a781ba6abe4>
   <match kubernetes.**>
     @type tag_normaliser
     @id flow:nginx-ingress:mxxx:0
     format "${namespace_name}.${pod_name}.${container_name}"
   </match>
   <filter **>
     @type parser
     @id flow:nginx-ingress:mxxx:1
     key_name "message"
     remove_key_name_field true
     reserve_data true
     <parse>
       @type "json"
     </parse>
   </filter>
   <match **>
     @type forward
     @id flow:nginx-ingress:mxxx:output:nginx-ingress:xxxlogging
     tls_allow_self_signed_cert true
     tls_insecure_mode true
     transport tls
     <buffer tag,time>
       @type "file"
       chunk_limit_size 1MB
       path "/buffers/flow:nginx-ingress:mxxx:output:nginx-ingress:xxxlogging.*.buffer"
       retry_forever true
       timekey 10m
       timekey_wait 1m
     </buffer>
     <server>
       host "xxxx"
       port 24002
     </server>
   </match>
</label>

Fluent-bit config

[SERVICE]
    log_level                       error

[INPUT]
    Name                            forward
    Listen                          0.0.0.0
    Port                            24002
    Buffer_Chunk_Size               1M
    Buffer_Max_Size                 128M
    tls                             on
    tls.verify                      off
    tls.crt_file                    /fluent-bit/etc/self_signed.crt
    tls.key_file                    /fluent-bit/etc/self_signed.key


[OUTPUT]
    Name                            kafka
    Match                           *
    Brokers                         kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics                          kubernetes-main-ingress
    Timestamp_Format                iso8601

# [OUTPUT]
#     Name                            stdout
#     Match                           *

[2023/08/23 14:55:44] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:55:44] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:55:50] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:55:50] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:03] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:03] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:06] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:06] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:06] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:06] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:15] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:15] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

Does seem to lower the error rate a bit, but no solution.

Aug 23 '23 14:08 sandervandegeijn

I observe the same issue with v2.1.8. How is your instance deployed? In my case in directly on VM (Ubuntu 20.04)

Aug 25 '23 16:08 YouShallNotCrash

This is the default fluent-bit container hosted in Docker on RHEL8

Aug 25 '23 17:08 sandervandegeijn

In my case, the setup is Fluent-bit_1 (on external k8s; Forward output plugin)-> Fluent-bit_2 (on Azure VM; Forward input plugin + Kafka output plugin) -> Kafka... Some (?) logs are flowing, although Fluent-bit_2 instance shows repetitive errors:

[2023/09/11 09:06:50] [error] [/tmp/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/09/11 09:06:50] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

with occasional:

[2023/09/12 18:00:14] [error] [tls] error: unexpected EOF
[2023/09/12 18:00:14] [error] [input:forward:forward.1] could not accept new connection

Sep 13 '23 08:09 YouShallNotCrash

Not much happening here I see... :( If I can provide any more info, which could be useful to understand the TLS issue please advise. Update to 2.2.0 didn't help.

Dec 05 '23 07:12 YouShallNotCrash

I'm sorry to say but we have moved away from fluentbit for most use cases because of this and because solving it takes quite long.

Dec 05 '23 07:12 sandervandegeijn

I'm sorry to say too. I deployed nginx servers as reverse proxy to terminate TLS instead. It has been very stable so far.

Dec 05 '23 10:12 ksauzz

Solved in my case by lowering the net.keepalive_idle_timeout (in my case to 30 sec). I assume that fluent-bit assumed the connections to be alive, while the server side had already discarded them.

Dec 17 '23 12:12 daberlin

Well 30 is supposed to be the default isnt it? https://docs.fluentbit.io/manual/administration/networking

Dec 17 '23 20:12 LukasJerabek

My bad... 30 sec was the original timeout - I lowered it to 10 sec. Anyway I'm still not sure whether the error and net.keepalive_idle_timeout are related or not...

Dec 18 '23 12:12 daberlin

I faced the same issue recently. My fluent bit pods were running behind a kubernetes load balancer which was sending health probes. These health probes were causing the "[error] [tls] error: unexpected EOF" errors. To fix this, I modified the externalTrafficPolicy to Local and updated the healthCheckNodePort. This ensures Kubernetes LB send the health probes on a separate port. Refer this for configuration: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip

Dec 29 '23 07:12 mansi1597

This is happening to us too. In logs I see:

[2024/01/17 12:39:26] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:39:26] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:52:20] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:52:20] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:54:22] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:54:22] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

We have connections with fluent-bit and fluentd, however I am not currently able to say from which one this origins.

Jan 17 '24 12:01 LukasJerabek

Well I have just noticed that lots of data are in fact really missing.

If I added require_ack_response true on fluentd side the data starts flowing and error disappears in fluentbit logs. However fluentd side cpu rises a lot, thats probably because he does not get the ack response and has to resend the messages that were not accepted (just a guess). So that suggests that there really is something wrong on fluentbits side.

Could someone please look into this? Seems like quite a problem, in our case it influences all metrics from fluentd.

Feb 20 '24 09:02 LukasJerabek

We had to use http plugins instead of forward for fluentd->fluentbit communication, that works. I would recommend the same, because one or the other is doing something wrong and given that it involves two separate projects and how long this issue is open it doesnt seem to be solved soon. However fluentbit->fluentbit works in our case even with forward plugin. in fluentd it is necessary to use this in http plugin:

    <format>
      @type json
    </format>
    json_array true

and to prepend the logs with similar filter:

  <filter **>
    @type record_transformer
    <record>
      tag my_server_pretag.${tag}
    </record>
  </filter>

Then on fluentbit side you just add to config:

    tag_key           tag

Dont forget to set the endpoint address with httpS, which I overlooked at first and is hard to debug.

Feb 23 '24 10:02 LukasJerabek

This is happening to us too. In logs I see:

[2024/01/17 12:39:26] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:39:26] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:52:20] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:52:20] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:54:22] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:54:22] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

We have connections with fluent-bit and fluentd, however I am not currently able to say from which one this origins.

We came across this error when playing around with the keep_alive settings. We purposefully increased it beyond the Load Balancer keep alive and we got that error when, I believe, got past the 60 sec LB timeout and it closed the connection.

Feb 29 '24 14:02 Gon-Infosum

As a slight update to the OP: we're still trying to clear the source of the error, but managed to clear a whole lot of them using ksniff (kubeshark was a bit too invasive for our taste). We then identified that our Prometheus tags/annotations for the fluent-bit server instance were misconfigured and Prometheus was trying to scrape that endpoint. That cleared a huge chunk of the errors for us but we're still trying to figure out the source of the few remaining entries.

Feb 29 '24 14:02 Gon-Infosum

We've gone ahead and enabled metrics and have been monitoring our setup. We got some new insights:

Sometimes we get bursts of traffic which lead to spikes in retries, sometimes leading to dropped chunks. This is probably due to the retries failing;
TLS errors have a strong correlation with these burst windows;
Adjusting the service scheduler.base and the Output net.connect_timeout and Workers lessened the amount of retries and so far we've yet to spot any dropped messages. However we're still witnessing TLS errors in the same timeframe as these retries;

Based on the above it seems there's something amiss when the two fluent-bit instances end up terminating the connection and go for a retry. Any suggestions on what we could do next to try and help address the issue?

Mar 11 '24 17:03 Gon-Infosum

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Jun 11 '24 01:06 github-actions[bot]

fluent-bit fluent-bit copied to clipboard

Forward could not accept new connection/tls unexpected EOF errors

Bug Report

fluent-bit
fluent-bit copied to clipboard