fluent-bit
                                
                                 fluent-bit copied to clipboard
                                
                                    fluent-bit copied to clipboard
                            
                            
                            
                        Forward could not accept new connection/tls unexpected EOF errors
Bug Report
Describe the bug Fluent-bit produces a large number of TLS/connection errors in its logs when TLS is enabled with the forwarding input plugin.
The use case is one instance of fluent-bit running inside EC2 outputting logs to a receiver fluent-bit instance running inside a kube cluster to securely forward messages into graylog.
Observations:
- Turning TLS verification off - still errors
- Running in debug mode I get further error messages e.g. [debug] [downstream] connection #51 failed,[debug] [socket] could not validate socket status for #52 (don't worry)
- Turning TLS off resolves the errors (predictably)
- Most messages are still making it to the receiver/server fluent-bit as far as I can see, haven't yet identified if messages are being lost or not
To Reproduce Example log messages:
[2023/05/17 14:21:40] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:40] [error] [tls] error: unexpected EOF
[2023/05/17 14:21:40] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:41] [error] [tls] error: unexpected EOF
[2023/05/17 14:21:41] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:41] [error] [tls] error: unexpected EOF
Occasionally:
[2023/05/17 14:21:42] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=0] Success
[2023/05/17 14:21:42] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
- Steps to reproduce the problem:
- Run receiving fluent-bit in kube
- Pod certificate issued by cert-manager vault issuer
- Second external fluent-bit (e.g. in EC2) sending messages to the receiving fluent-bit with the vault ca_chain in config
Expected behavior Fluent-bit doesn't spew TLS errors
Your Environment
- Version used: 2.1 (also 2.0.6)
- Configuration: Fluent-bit helm chart running in kube fluent-bit.conf (server/receiver side - filters removed for simplicity)
 [SERVICE]
    Daemon off
    Flush 10
    Log_Level debug
    parsers_file custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On
    storage.path /var/logs/flb-logs
    storage.sync full
[INPUT]
    name forward
    listen 0.0.0.0
    port 24224
    tls on
    tls.debug 4
    tls.verify on
    tls.crt_file /etc/tls/fluent-bit-ingress-tls/tls.crt
    tls.key_file /etc/tls/fluent-bit-ingress-tls/tls.key
    storage.type filesystem
[OUTPUT]
    Name                    gelf
    Match                   *
    Host                    ~URL omitted~
    Port                    12212
    Mode                    tls
    tls                     On
    tls.verify              Off
    tls.ca_file             /fluent-bit/etc/ca.crt
    tls.vhost               ~URL omitted~
    Gelf_Short_Message_Key  message
    Gelf_Host_Key           container_name
    storage.total_limit_size 256MB
fluent-bit.conf (client/sender side - filters removed for simplicity):
[SERVICE]
    parsers_file /fluent-bit/etc/parsers.conf
[INPUT]
    name forward
    listen 0.0.0.0
    port 24224
[OUTPUT]
    Name stdout
    Format json_lines
    Match OUTPUT
[OUTPUT]
    Name forward
    Match OUTPUT
    Host ~URL omitted~
    Port 24224
    tls on
    tls.verify on
    tls.ca_file /etc/fluent-bit/ca.crt
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.23 (EKS)
- Server type and version:
- Operating System and version: Fedora CoreOS EC2 instance
- Filters and plugins: Forward input/outputs gelf output
Additional context
Did I understand you correctly that even though you are getting those errors data is flowing? If that's the case then I wonder if that's due to time slice shifting. Could you try these two things individually and then combined?
- Add threaded onto the input plugin (forward)
- Add workers 1to the output plugin (gelf)
That should greatly alleviate the pressure on the main thread and could give us some valuable insight.
Hi Leonardo,
Thanks for the prompt reply, apologies for the delay I had a long weekend off!
So, I tried:
- Both threaded onandworkers 1set in the input and output plugins respectively
- Just threaded onset in the input plugin
- Just workers 1set in the output plugin
I only tried this on the server/receiver side; I'm still experiencing the same errors
In that case the only thing that comes to mind is using kubeshark to capture the traffic which would let us know if those connection attempts are being aborted by the remote host due to a delayed handshake attempt or what exactly is going on.
If you decide to capture the traffic you can share those pcaps in private with me in slack. I'll look at them, give you some feedback and try to come up with the next step.
Was there any resolution on this? I'm seeing the same thing for fluent-bit running in a Nomad environment. We're presently running version 2.1.4. I turned off all outputs to minimize the configuration.
Here is the INPUT configuration:
[INPUT]
    Name           forward
    Listen         0.0.0.0
    port           24224
    threaded       on
    tls            on
    tls.debug      4
    tls.verify     off
    tls.ca_file    /fluent-bit/etc/ca.cert.pem
    tls.crt_file   /fluent-bit/etc/devl.cert.pem
    tls.key_file   /fluent-bit/etc/devl.key.pem
Here is a sample of the log output:
[2023/06/07 21:44:35] [error] [tls] error: unexpected EOF
[2023/06/07 21:44:35] [debug] [downstream] connection #55 failed
[2023/06/07 21:44:35] [error] [input:forward:forward.0] could not accept new connection
Disregard my issue. I found that I had my local nomad logger, logging to fluentbit as well and that does not support TLS. That was the source of my errors. Once I added a non-TLS port for that traffic, the errors cleared up.
I saw the same issue. It seems like fluent-bit's throughput just becomes lower when enabling tls in forward input. As a result, forward output creates so many connections for newer chunks because existing connections are still used, then forward input becomes refusing newer connections by could not accept new connection error.
To prevent creating a large number of connections, set net.max_worker_connections to 20 or something in forward input, which was introduced in 2.1.6. But it might also cause no upstream connections available error in forward input .
https://docs.fluentbit.io/manual/administration/networking#max-connections-per-worker
Running into similar issues on 2.1.8 with TLS. Using default docker container, fluentd forwarding to fluentbit.
Config
[SERVICE]
    log_level    debug
[INPUT]
    Name              forward
    Listen            0.0.0.0
    Port              24002
    Buffer_Chunk_Size 1M
    Buffer_Max_Size   6M
    tls on
    tls.verify off
    tls.crt_file /fluent-bit/etc/self_signed.crt
    tls.key_file /fluent-bit/etc/self_signed.key
# [OUTPUT]
#     Name stdout
#     Match *
[OUTPUT]
    Name        kafka
    Match       *
    Brokers     kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics      kubernetes-main-ingress
    Timestamp_Format iso8601
[2023/08/06 10:05:51] [debug] [out flush] cb_destroy coro_id=5
[2023/08/06 10:05:51] [debug] [task] destroy task=0x7fe6a2039aa0 (task_id=0)
[2023/08/06 10:05:52] [debug] [socket] could not validate socket status for #41 (don't worry)
[2023/08/06 10:05:53] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:05:55] [debug] [socket] could not validate socket status for #41 (don't worry)
[2023/08/06 10:05:56] [debug] [socket] could not validate socket status for #40 (don't worry)
[2023/08/06 10:06:01] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:06:01] [debug] [socket] could not validate socket status for #44 (don't worry)
[2023/08/06 10:06:02] [debug] [input chunk] update output instances with new chunk size diff=34495, records=28, input=forward.0
[2023/08/06 10:06:02] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:06:02] [debug] [task] created task=0x7fe6a2039780 id=0 OK
[2023/08/06 10:06:03] [debug] [socket] could not validate socket status for #44 (don't worry)
{"stream"=>"[2023/08/06 10:06:03] [debug] in produce_message
Log level error
Fluent Bit v2.1.8
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2023/08/06 10:29:47] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:47] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/06 10:29:48] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:48] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/06 10:29:51] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:51] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
Related: https://github.com/fluent/fluent-bit/issues/6690
Any updates on this? This is killing my production environment performance :)
@ict-one-nl could you paste the fluentd config, mainly the match statement
I asked around, is this what you were asking for?
<label @xxxxx>
  <match kubernetes.**>
    @type tag_normaliser
    @id flow:xxxxx:xxxxx:0
    format ${namespace_name}.${pod_name}.${container_name}
  </match>
  <filter **>
    @type parser
    @id flow:xxxx:xxxxx:1
    key_name message
    remove_key_name_field true
    reserve_data true
    <parse>
      @type json
    </parse>
  </filter>
  <match **>
    @type forward
    @id flow:xxxx:xxxx:output:xxxx:xxxxx-logging
    tls_allow_self_signed_cert true
    tls_insecure_mode true
    transport tls
    <buffer tag,time>
      @type file
      chunk_limit_size 8MB
      path /buffers/flow:xxx:xxx:output:xxx:xxxxxx.*.buffer
      retry_forever true
      timekey 10m
      timekey_wait 1m
    </buffer>
    <server>
      host xxxxxxxx.nl
      port 24002
    </server>
  </match>
</label>
Thanks @ict-one-nl , I'm wondering if the buffer size needs to be larger on the Fluent Bit side to match what you have there with 8MB chunk limit. You may want to try and lower that on Fluentd side as well
I have tried the larger buffer size:
[SERVICE]
    log_level                       error
[INPUT]
    Name                            forward
    Listen                          0.0.0.0
    Port                            24002
    Buffer_Chunk_Size               8M
    Buffer_Max_Size                 128M
    tls                             on
    tls.verify                      off
    tls.crt_file                    /fluent-bit/etc/self_signed.crt
    tls.key_file                    /fluent-bit/etc/self_signed.key
[OUTPUT]
    Name                            kafka
    Match                           *
    Brokers                         kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics                          kubernetes-main-ingress
    Timestamp_Format                iso8601
# [OUTPUT]
#     Name                            stdout
#     Match                           *
[2023/08/23 14:52:31] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:34] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:34] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:36] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:36] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:38] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:38] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:44] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:44] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
Same result. Will try the lower chunk size as well
FluentD config:
<label @3131968136fc14962a3d7a781ba6abe4>
   <match kubernetes.**>
     @type tag_normaliser
     @id flow:nginx-ingress:mxxx:0
     format "${namespace_name}.${pod_name}.${container_name}"
   </match>
   <filter **>
     @type parser
     @id flow:nginx-ingress:mxxx:1
     key_name "message"
     remove_key_name_field true
     reserve_data true
     <parse>
       @type "json"
     </parse>
   </filter>
   <match **>
     @type forward
     @id flow:nginx-ingress:mxxx:output:nginx-ingress:xxxlogging
     tls_allow_self_signed_cert true
     tls_insecure_mode true
     transport tls
     <buffer tag,time>
       @type "file"
       chunk_limit_size 1MB
       path "/buffers/flow:nginx-ingress:mxxx:output:nginx-ingress:xxxlogging.*.buffer"
       retry_forever true
       timekey 10m
       timekey_wait 1m
     </buffer>
     <server>
       host "xxxx"
       port 24002
     </server>
   </match>
</label>
Fluent-bit config
[SERVICE]
    log_level                       error
[INPUT]
    Name                            forward
    Listen                          0.0.0.0
    Port                            24002
    Buffer_Chunk_Size               1M
    Buffer_Max_Size                 128M
    tls                             on
    tls.verify                      off
    tls.crt_file                    /fluent-bit/etc/self_signed.crt
    tls.key_file                    /fluent-bit/etc/self_signed.key
[OUTPUT]
    Name                            kafka
    Match                           *
    Brokers                         kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics                          kubernetes-main-ingress
    Timestamp_Format                iso8601
# [OUTPUT]
#     Name                            stdout
#     Match                           *
[2023/08/23 14:55:44] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:55:44] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:55:50] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:55:50] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:03] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:03] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:06] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:06] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:06] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:06] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:15] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:15] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
Does seem to lower the error rate a bit, but no solution.
I observe the same issue with v2.1.8. How is your instance deployed? In my case in directly on VM (Ubuntu 20.04)
This is the default fluent-bit container hosted in Docker on RHEL8
In my case, the setup is Fluent-bit_1 (on external k8s; Forward output plugin)-> Fluent-bit_2 (on Azure VM; Forward input plugin + Kafka output plugin) -> Kafka... Some (?) logs are flowing, although Fluent-bit_2 instance shows repetitive errors:
[2023/09/11 09:06:50] [error] [/tmp/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/09/11 09:06:50] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
with occasional:
[2023/09/12 18:00:14] [error] [tls] error: unexpected EOF
[2023/09/12 18:00:14] [error] [input:forward:forward.1] could not accept new connection
Not much happening here I see... :( If I can provide any more info, which could be useful to understand the TLS issue please advise. Update to 2.2.0 didn't help.
I'm sorry to say but we have moved away from fluentbit for most use cases because of this and because solving it takes quite long.
I'm sorry to say too. I deployed nginx servers as reverse proxy to terminate TLS instead. It has been very stable so far.
Solved in my case by lowering the net.keepalive_idle_timeout (in my case to 30 sec).
I assume that fluent-bit assumed the connections to be alive, while the server side had already discarded them.
Well 30 is supposed to be the default isnt it? https://docs.fluentbit.io/manual/administration/networking
My bad... 30 sec was the original timeout - I lowered it to 10 sec.
Anyway I'm still not sure whether the error and net.keepalive_idle_timeout are related or not...
I faced the same issue recently. My fluent bit pods were running behind a kubernetes load balancer which was sending health probes. These health probes were causing the "[error] [tls] error: unexpected EOF" errors. To fix this, I modified the externalTrafficPolicy to Local and updated the healthCheckNodePort. This ensures Kubernetes LB send the health probes on a separate port. Refer this for configuration: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
This is happening to us too. In logs I see:
[2024/01/17 12:39:26] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:39:26] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:52:20] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:52:20] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:54:22] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:54:22] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
We have connections with fluent-bit and fluentd, however I am not currently able to say from which one this origins.
Well I have just noticed that lots of data are in fact really missing.
If I added require_ack_response true on fluentd side the data starts flowing and error disappears in fluentbit logs. However fluentd side cpu rises a lot, thats probably because he does not get the ack response and has to resend the messages that were not accepted (just a guess). So that suggests that there really is something wrong on fluentbits side.
Could someone please look into this? Seems like quite a problem, in our case it influences all metrics from fluentd.
We had to use http plugins instead of forward for fluentd->fluentbit communication, that works. I would recommend the same, because one or the other is doing something wrong and given that it involves two separate projects and how long this issue is open it doesnt seem to be solved soon. However fluentbit->fluentbit works in our case even with forward plugin. in fluentd it is necessary to use this in http plugin:
    <format>
      @type json
    </format>
    json_array true
and to prepend the logs with similar filter:
  <filter **>
    @type record_transformer
    <record>
      tag my_server_pretag.${tag}
    </record>
  </filter>
Then on fluentbit side you just add to config:
    tag_key           tag
Dont forget to set the endpoint address with httpS, which I overlooked at first and is hard to debug.
This is happening to us too. In logs I see:
[2024/01/17 12:39:26] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 12:39:26] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib [2024/01/17 12:52:20] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 12:52:20] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib [2024/01/17 12:54:22] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 12:54:22] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib [2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib [2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH libWe have connections with fluent-bit and fluentd, however I am not currently able to say from which one this origins.
We came across this error when playing around with the keep_alive settings. We purposefully increased it beyond the Load Balancer keep alive and we got that error when, I believe, got past the 60 sec LB timeout and it closed the connection.
As a slight update to the OP: we're still trying to clear the source of the error, but managed to clear a whole lot of them using ksniff (kubeshark was a bit too invasive for our taste). We then identified that our Prometheus tags/annotations for the fluent-bit server instance were misconfigured and Prometheus was trying to scrape that endpoint. That cleared a huge chunk of the errors for us but we're still trying to figure out the source of the few remaining entries.
We've gone ahead and enabled metrics and have been monitoring our setup. We got some new insights:
- Sometimes we get bursts of traffic which lead to spikes in retries, sometimes leading to dropped chunks. This is probably due to the retries failing;
- TLS errors have a strong correlation with these burst windows;
- Adjusting the service scheduler.baseand the Outputnet.connect_timeoutandWorkerslessened the amount of retries and so far we've yet to spot any dropped messages. However we're still witnessing TLS errors in the same timeframe as these retries;
Based on the above it seems there's something amiss when the two fluent-bit instances end up terminating the connection and go for a retry. Any suggestions on what we could do next to try and help address the issue?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.