fluent-bit
fluent-bit copied to clipboard
Fluentbit is stuck, flb chunks are not flushed and not sending data to ElasticSearch
Bug Report
Describe the bug Fluent is sending data to elasticsearch and it works fine for sometime and then its stuck and not sending data to elastic. On checking fluentbit pod logs, Briefly I see below 3 issues (Attached full logs below):
- [ info] [task] re-schedule retry=0x7f6a1ecc7b68 135 in the next 8 seconds
- [ warn] [engine] failed to flush chunk '1-1648141166.449162875.flb', retry in 8 seconds: task_id=517, input=storage_backlog.1 > output=es.0 (out_id=0)
- [debug] [input:tail:tail.0] 0 new files found on path '/var/log/kube-apiserver-audit.log'
- (even though the file '/var/log/kube-apiserver-audit.log' exists inside a pod )
Referred similar issues, still no luck:
- As per https://github.com/fluent/fluent-bit/issues/3581#issuecomment-1081107592 I have tried to use 1.9.1-debug version as well, still seeing same results
- https://github.com/fluent/fluent-bit/issues/4336
To Reproduce
- start fluentbit as Daemonset on K8S and add node affinity to run only on master node
- Use tail plugin to collect API server Audit Logs (Generated on path
/var/log/kube-apiserver-audit.log
) - Please refer configs mentioned below in Configuration section
After that Seeing below logs frequently in fluentbit pod:
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351188.130831321.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351188.130831321.flb
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecc7b68 135 in the next 8 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecd7478 191 in the next 10 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecb0b48 104 in the next 10 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed02588 85 in the next 7 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed02498 64 in the next 11 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed022e0 30 in the next 11 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1eccae08 147 in the next 9 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecb6c28 28 in the next 9 seconds
....
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] scanning path /var/log/kube-apiserver-audit.log
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/kube-apiserver-audit.log, inode 1882
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/kube-apiserver-audit.log'
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] inode=1882 events: IN_MODIFY
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648635501.574187623.flb mapped OK
[2022/03/30 10:18:38] [debug] [storage] [cio file] alloc_size from 1114112 to 1245184
[2022/03/30 10:18:38] [debug] [storage] [cio file] synced at: tail.0/1-1648635501.574187623.flb
[2022/03/30 10:18:38] [debug] [input chunk] update output instances with new chunk size diff=128565
[2022/03/30 10:18:38] [debug] [retry] new retry created for task_id=517 attempts=1
[2022/03/30 10:18:38] [ warn] [engine] failed to flush chunk '1-1648141166.449162875.flb', retry in 8 seconds: task_id=517, input=storage_backlog.1 > output=es.0 (out_id=0)
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351224.176425521.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351224.176425521.flb
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351264.88207341.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351264.88207341.flb
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351301.408233326.flb mapped OK
On checking further inside fluentbit pod, I see chunks are not cleared and filling up the disk:
root@fluent-bit-audit-drbxj:/var/log# du -sh flb-storage/tail.0/
26G flb-storage/tail.0/ <------
root@fluent-bit-audit-qbkw8:/var/log# ls -la kube-apiserver-audit*
-rw------- 1 root root 104857010 Mar 31 08:19 kube-apiserver-audit-2022-03-31T08-19-39.831.log
-rw------- 1 root root 95179025 Mar 31 08:46 kube-apiserver-audit.log
Expected behavior
- Keep sending data to ElasticSearch, should not be stuck
- Should be able to flush chunk files(
*.flb
files) regularly (in my case directory is/var/log/flb-storage/tail.0/
Your Environment
- Version used:
1.8.15
and1.9.1-debug
(tested on both) - Configuration (fluentbit configuration):
# Configuration files: server, input, filters and output
# ======================================================
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level debug
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 50M
@INCLUDE input-audit.conf
@INCLUDE output-elasticsearch.conf
input-audit.conf: |
[INPUT]
Name tail
Tag kube-audit.*
Path /var/log/kube-apiserver-audit.log
Parser json
DB /var/log/flb_kube_audit.db
storage.type filesystem
Mem_Buf_Limit 800MB
Skip_Long_Lines On
Refresh_Interval 10
Buffer_Chunk_Size 4096k
Buffer_Max_Size 9000k
output-elasticsearch.conf: |
[OUTPUT]
Name es
Match kube-audit.*
Host ${KUBE_ELASTICSEARCH_HOST}
Port ${KUBE_ELASTICSEARCH_PORT}
Index kube-audit-dev
Buffer_Size False
Generate_ID on
AWS_Auth On
AWS_Region us-east-1
Replace_Dots On
Retry_Limit False
tls On
Trace_Error On
Logstash_Format On
Logstash_Prefix kube-audit-dev
parsers.conf: |
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.22.4
- Server type and version: Running as docker image on Kubernetes master node (using docker image:
1.9.1-debug
- Operating System and version: Linux
- Filters and plugins: Plugin:
tail, es
Additional context
- I am running a separate copy of fluentbit as daemonset (which runs only on master node) and collect API server Audit Logs (Generated on path
/var/log/kube-apiserver-audit.log
) - Initially, its working fine. After some time, chunks are piling up (failed to flush) and I see retry messages in log and fluentbit stops sending data to ElasticSearch.
@lecaros @edsiper
Need your inputs on the same. Let me know If I am missing anything/not configured correctly on Fluentbit configuration.
Same issues with fluent-bit:1.9.2 .
Hi @joshisumit, do you see the input records metric still going up? (i.e. it continues reading events) With debug log_level enabled, we should see more details on what's happening with your chunks. Can you share the full log or its tail (while it's not sending any more events to es) at least? Thanks
@lecaros I think i had network issues, but now it's working without any issues.
Thanks @dyipon for confirming. Can you @joshisumit confirm you still have this issue?
I am still seeing this issue with forward output and disk buffering with fluentbit 1.9.3.
[2022/05/04 19:15:58] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 4 seconds
[2022/05/04 19:16:02] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:08] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:14] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:20] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 7 seconds
[2022/05/04 19:16:27] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 4 seconds
[2022/05/04 19:16:31] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:37] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 4 seconds
[2022/05/04 19:16:41] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:47] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 4 seconds
[2022/05/04 19:16:51] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 7 seconds
Poor man's workaround might be to delete stucked chunks, eg. periodically delete everything not modified in last hour:
find /var/log/flb-storage -type f -name "*.flb" -mmin +60 | xargs rm -f || true
@fpytloun do you have repro to this issue? The messages you posted don't necessarily mean you are having an issue.
Thanks
Unfortunately, I was not able to find out what's causing this. I just know that it's persistent issue (it repeats even after restart until disk buffer is wiped), so I assumed it could be eg. some garbage inside buffered messages?
With memory buffering and without retries, affected event is simply dropped and this issue doesn't occur.
Hi everyone, in order to address this, we need a reproducible scenario. If any of you have repro steps and can answer other questions posted to troubleshoot this, please provide them and we'll gladly assist.
Hi I got same issue, and resolved the problem by deleting *.flb.
To explain my situation in detail, I was using aws-for-fluent-bit version 2.10.0 (fluent-bit 1.6.x) and the crash continued to occur, so I upgraded to 2.25.1 (fluent-bit 1.9.3). I wanted to use the stable version of aws-for-fluent-bit, but in the stable version, an error such as invalid back log continued to occur, so I upgraded to the 1.9.x version that can skip the error.
And my fluent-bit situation is like the picture below. The pod with the error has unusually high memory(6xx ~ 8xx)
And the number of flb files is unusually high. (hundred thousands ~ miliions)
The reason why so many flb files could not be flushed is guessing that the crash occurred in 1.6.x. Therefore, it seems to be a problem that occurs when there are many flb files.
Changed configuration a little to switch from forward to http output and still facing this issue when enabling storage buffer and retries. It seems to happen every time 10 minutes after starting fluentbit. No error in logs or other clue. Fluentbit healthcheck still thinks it's healthy.
This is pretty serious issue and I have no idea why everyone is not hitting it. I can provide oldest flb (so one that might have cause the deadlock).
Here's full config that I am running: https://pastebin.com/3JSwBK2k
I found out that my Fluentbit is sending events with non-unicode characters (eg. UTF-16/UTF-32). Being logged on Vector's HTTP sink (fluentbit upstream) as error with parsing JSON payload.
So I was thinking if these garbled characters might cause deadlock. I tried to ensure these characters will not be present and so far it seems that issue went away.
Beside ensuring on log source side, I also tried this Lua filter to replace non-ascii characters:
function deep_replace(obj, source, target)
if obj then
if type(obj) == "table" then
local result = {}
for k,v in pairs(obj) do
result[k] = deep_replace(v, source, target)
end
return result
elseif type(obj) == "string" then
return obj:gsub(source, target)
else
return obj
end
end
return obj
end
function escape_non_ascii(tag, timestamp, record)
local new_record = deep_replace(record, "([\128-\255])", "\\\\%1")
return 1, timestamp, new_record
end
[FILTER]
Name lua
Match *
Script /fluent-bit/scripts/filters.lua
Call escape_non_ascii
UPDATE: ok, maybe I was too optimistic. Found one instance hitting backpressure without obvious reason:
rewrite_kube (emitter)
│
├─ status
│ └─ overlimit : yes
│ ├─ mem size : 56.2M (58912182 bytes)
│ └─ mem limit : 9.5M (10000000 bytes)
│
├─ tasks
│ ├─ total tasks : 41
│ ├─ new : 0
│ ├─ running : 41
│ └─ size : 56.2M (58912182 bytes)
│
└─ chunks
└─ total chunks : 510
├─ up chunks : 42
├─ down chunks: 468
└─ busy chunks: 41
├─ size : 56.2M (58912182 bytes)
└─ size err: 0
UPDATE2: over few hours, all instances got deadlocked again. No error in logs except that total storage size (1GB) of http output plugin is reached:
[error] [input chunk] chunk 7-1661314436.598970034.flb would exceed total limit size in plugin http.0
[error] [input chunk] no available chunk
Storage is full of chunks from rewrite_tag
emitter.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Not stale
@lecaros
Hi everyone, in order to address this, we need a reproducible scenario. If any of you have repro steps and can answer other questions posted to troubleshoot this, please provide them and we'll gladly assist.
One of the problems I have with this is that it occurs only when elastic has issues. And we have not been able to figure out how to force the issue. It could run for days or weeks without seeing the problem, and when elastic has issues that causes rejections this occurs. It's difficult to run fluent-bit with debug log enabled for the length of time necessary to reproduce this. It does appear to occur when elastic prevents ingestion for some significant period of time (at least a few hours). Is there anything that you can tell us about how fluent-bit handles retrying flb chunks? It's interesting that the workarounds above of deleting older flb files seems to unblock things. Is there anything from that that makes sense for us to gather more targetted debugging?
I'm having a hard time collecting the logs when this occurs with debug on. The issue is, when the error condition occurs, our fluent-bit logs are no longer being sent to elatic. But the sheer volume of debug logs causes our k8s nodes to rotate the logs more quickly than I can pull them. I'm sure their in flb files but I don't know how to extract them.
The scenario that causes the problem for us is, something cause elastic search to start reject requests - we have seen two conditions that cause this
- excessive load resulting in elastic sending back 429s to requests
- elastic ran out of disk space
when these occur, fluent-bit stops sending logs to elastic but continues to read from tail into new flb files.
If we restart fluent-bit it attempts to reprocess the flb files, but if there are multiple nodes, restarting all of the fluent-bit pods will very very quickly overwhelm elastic adn elastic will start to throttle.. 429s.
This is resulting in a high retry/failure rate. We have our retry limit set to 10. If a limit is reached what happens to the flb files? Do they get removed for that transaction? If not, that might be part of our issue. I've discovered that if i rotate each fluent-bit pod ONE at a time, elastic does not get overloaded and it is able to work through the backlog flb files. But that's a time consuming process of course.
met same error, but I use loki as backend https://github.com/fluent/fluent-bit/issues/6670
I have the same issue with fluentd as a backend (using logging-operator)
Hello @awangptc,
If a limit is reached what happens to the flb files?
flb files are no longer retried and are deleted from the storage.
Are you all using the latest image? We've recently (last month) merged some fixes related to irrecoverable chunks handling.
@lecaros, I tried 2.0.5 when we first ran into the problem in hopes that it worked better and it was incredibly unstable:
#5 0x557ee21c1b16 in flb_malloc() at include/fluent-bit/flb_mem.h:79
#6 0x557ee291afd8 in ares_init_options() at lib/c-ares-1.18.1/src/lib/ares_init.c:114
#7 0x557ee21c66b2 in flb_net_dns_lookup_context_create() at src/flb_network.c:969
#8 0x557ee21c690a in flb_net_getaddrinfo() at src/flb_network.c:1035
#9 0x557ee21c6eef in flb_net_tcp_connect() at src/flb_network.c:1231
#10 0x557ee22758f1 in flb_io_net_connect() at src/flb_io.c:130
#11 0x557ee21f1206 in create_conn() at src/flb_upstream.c:533
#12 0x557ee21f1714 in flb_upstream_conn_get() at src/flb_upstream.c:682
#13 0x557ee247790c in cb_es_flush() at plugins/out_es/es.c:808
#14 0x557ee21b6da0 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:519
#15 0x557ee2be6e46 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
free(): corrupted unsorted chunks
[2022/11/18 18:36:26] [engine] caught signal (SIGSEGV)
#0 0x7f0d0e9a7602 in ???() at ???:0
#1 0x7f0d0ea00767 in ???() at ???:0
#2 0x7f0d0ea07a59 in ???() at ???:0
#3 0x7f0d0ea09121 in ???() at ???:0
#4 0x560ea4114175 in flb_free() at include/fluent-bit/flb_mem.h:120
#5 0x560ea411540a in flb_sds_destroy() at src/flb_sds.c:470
#6 0x560ea43f8b4f in elasticsearch_format() at plugins/out_es/es.c:579
#7 0x560ea43f999e in cb_es_flush() at plugins/out_es/es.c:814
#8 0x560ea4138da0 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:519
#9 0x560ea4b68e46 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
And unfortunately haven't been able to revisit. It looks like the changelog shows that there were a bunch of improvements in memory-related areas. Do you think stuff like that has been addressed? If so, I'll take a closer look at the latest 2.0 builds and see if it makes a difference.
Please try the latest image. Either 2.0.8 or even nightly builds.
Having a similar issue with public.ecr.aws/aws-observability/aws-for-fluent-bit:init-2.31.2 sending logs to AWS OpenSearch.
[2023/02/28 20:41:40] [ warn] [engine] failed to flush chunk '6-1677616889.860927077.flb', retry in 21 seconds: task_id=2, input=forward.1 > output=opensearch.1 (out_id=1)
[2023/02/28 20:41:41] [ warn] [engine] chunk '6-1677616877.850448169.flb' cannot be retried: task_id=1, input=forward.1 > output=opensearch.1
[2023/02/28 20:41:42] [ warn] [engine] failed to flush chunk '6-1677616901.873936214.flb', retry in 10 seconds: task_id=1, input=forward.1 > output=opensearch.1 (out_id=1)
[2023/02/28 20:41:42] [ warn] [engine] failed to flush chunk '6-1677616901.874024444.flb', retry in 6 seconds: task_id=4, input=forward.1 > output=opensearch.1 (out_id=1)
[2023/02/28 20:41:43] [ warn] [engine] chunk '6-1677616877.850342703.flb' cannot be retried: task_id=0, input=forward.1 > output=opensearch.1
[2023/02/28 20:41:48] [ warn] [engine] failed to flush chunk '6-1677616901.874024444.flb', retry in 7 seconds: task_id=4, input=forward.1 > output=opensearch.1 (out_id=1)
We've been running fluent-bit 2.0.8 for about a month now and things seem more stable. We haven't had a catastropihic outage of elastic that seems to trigger the rproblem before (elastic being down for hours at a time) but we have had elastic being down for 20-30 minutes or so at a time and so far we haven't seen fluent-bit 2.0.8 get stuck. We're planning on trying to implement a failure situation - shutdown elastic, or block it's port from the client - for a few hours at a time to see if the issue still occurs but it's a little lower on our priority list to try.
@awangptc would you mind sharing your fluent-bit config? I am running into issues with logs not being forwarded. Only 1 pod (base nginx test pod) I can find being forwarded in Kibana. I cannot see my application logs, I am not sure why. I get various different issues no matter what I set. Here is my config.
apiVersion: v1
items:
- apiVersion: v1
data:
custom_parsers.conf: |
[PARSER]
Name docker_no_time
Format json
Time_Keep On
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
fluent-bit.conf: |
[SERVICE]
Daemon Off
Flush 1
Log_Level info
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
Mem_Buf_Limit 50MB
Skip_Long_Lines Off
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Read_From_Tail On
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[OUTPUT]
Name es
Match kube.*
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
HTTP_User ${FLUENT_ELASTICSEARCH_USER}
HTTP_Passwd ${FLUENT_ELASTICSEARCH_PASSWD}
tls On
tls.verify On
tls.ca_file ${FLUENT_ELASTICSEARCH_TLSCA}
tls.crt_file ${FLUENT_ELASTICSEARCH_TLSCRT}
tls.key_file ${FLUENT_ELASTICSEARCH_TLSKEY}
Logstash_Format On
Suppress_Type_Name On
Retry_Limit False
[OUTPUT]
Name es
Match host.*
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
HTTP_User ${FLUENT_ELASTICSEARCH_USER}
HTTP_Passwd ${FLUENT_ELASTICSEARCH_PASSWD}
tls On
tls.verify On
tls.ca_file ${FLUENT_ELASTICSEARCH_TLSCA}
tls.crt_file ${FLUENT_ELASTICSEARCH_TLSCRT}
tls.key_file ${FLUENT_ELASTICSEARCH_TLSKEY}
Logstash_Format On
Suppress_Type_Name On
Logstash_Prefix node
Retry_Limit False
I am having this issue with the AWS fluentbit fork - I filed a bug with them that has a full reproducible example with all configs here https://github.com/aws/aws-for-fluent-bit/issues/568
I also tested those configs with fluent/fluent-bit:2.0.9 and was able to reproduce it.
In both cases, if I comment out the two "parser" filters in my filters.conf
, then this issue does not happen (but of course my messages are unparsed)
I was able to fix my issue with Replace_Dots On in my config file. Apparently, it is required since Elasticsearch 2.0.x, but I am running the latest Elasticsearch 8.5.x.
I hope this helps someone.
I tried with Replace_Dots On
in the opensearch output but with the same results. 2 of the messages make it to opensearch (like before, but now with dots replaced) and 2 still fail with "failed to flush chunk"
@rectalogic could you put together a minimal reproduction case? If you do please ping me and I'll try to fix it.
@leonardo-albertovich see https://github.com/aws/aws-for-fluent-bit/issues/568 - it has all the config files and the docker command to run to reproduce. You just need to change the -e OPENSEARCH_FQDN=search.play-test.internal.aws.cureatr.com
on that docker command to point to your OpenSearch instance.
Also if you change the image in that command from public.ecr.aws/aws-observability/aws-for-fluent-bit:init-2.31.3
to fluent/fluent-bit:2.0.9
you can still reproduce the issue.