fluent-bit
fluent-bit copied to clipboard
Tail input plugin not fetching all logs when files are rotated quickly under heavy load
Bug Report
Describe the bug Tail input plugin not able to tail files when the file rotation happens. Used a container that generates 1,000,000 lines that log it to stdout. configured fluent-bit to tail the logs files and print it to standard output. The tail plugin is only able to get ~200000 lines (only from the first file). The plugin is not able to fetch lines after the rotation happens under heavy load.
To Reproduce
- Example log message if applicable:
{"log":"132.224.216.5 - roob8562 [17/Apr/2020:19:26:55 +0000] \"GET /architect HTTP/1.1\" 403 3271\n","stream":"stdout","time":"2020-04-17T19:27:02.669131074Z"}
- Steps to reproduce the problem:
Expected behavior Generate 1,000,000 logs lines and log it to standard output and expect fluent-bit to tail and fetch all those 1,000,000 lines.
Your Environment
- Version used: 1.4.2
- Configuration:
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.15.10
- Server type and version:
- Operating System and version: Linux
- Filters and plugins: Tail and Standard Output
Additional context
fluent-bit configuration
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluent-bit
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: fluent-bit-read
rules:
- apiGroups: [""]
resources:
- namespaces
- pods
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: fluent-bit-read
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluent-bit-read
subjects:
- kind: ServiceAccount
name: fluent-bit
namespace: default
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config-ds
namespace: default
labels:
k8s-app: fluent-bit
data:
# Configuration files: server, input, filters and output
# ======================================================
#
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 500M
@INCLUDE input-kubernetes.conf
@INCLUDE output-stdout.conf
input-kubernetes.conf: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/plaintext-*.log
DB /var/log/flb_kube.db
Parser docker
Mem_Buf_Limit 500MB
storage.type filesystem
Skip_Long_Lines On
Refresh_Interval 1
Rotate_Wait 10
output-stdout.conf: |
[OUTPUT]
Name stdout
Match kube.*
Format json_lines
json_date_format iso8601
parsers.conf: |
[PARSER]
Name stdout
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fl-exp
namespace: default
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
k8s-app: fluent-bit-logging
template:
metadata:
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "2020"
prometheus.io/path: /api/v1/metrics/prometheus
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:1.4.2
imagePullPolicy: IfNotPresent
ports:
- containerPort: 2020
resources:
requests:
cpu: 1000m
memory: 1024Mi
limits:
cpu: 1000m
memory: 1024Mi
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
terminationGracePeriodSeconds: 10
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config-ds
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- operator: "Exists"
effect: "NoExecute"
- operator: "Exists"
effect: "NoSchedule"
Metrics from one of the run
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="storage_backlog.1"} 0 1587153190798
fluentbit_input_bytes_total{name="tail.0"} 109649569 1587153190798
# TYPE fluentbit_input_files_closed_total counter
fluentbit_input_files_closed_total{name="tail.0"} 1 1587153190798
# TYPE fluentbit_input_files_opened_total counter
fluentbit_input_files_opened_total{name="tail.0"} 2 1587153190798
# TYPE fluentbit_input_files_rotated_total counter
fluentbit_input_files_rotated_total{name="tail.0"} 0 1587153190798
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="storage_backlog.1"} 0 1587153190798
fluentbit_input_records_total{name="tail.0"} 567607 1587153190798
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.0"} 0 1587153190798
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.0"} 109649569 1587153190798
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.0"} 567607 1587153190798
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1587153190798
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.0"} 0 1587153190798
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1587151599
The plugin is not able keep up if fluentbit_input_bytes_total rate goes beyond 100 Kb per second. used the query rate(fluentbit_input_bytes_total{name="tail.0"}[5m])
We encounter the same behaviour. If logs are written under high load to STDOUT fluent-bit stops working on docker log rotation and resume on the next rotation. So for one rotation intervall it works, for the next interval not and so on.
Further we did some small tests, which leads us to say the problem occurs between 1000-1250 loglines per second per node. The log size seem not to matter, because we tested with 132 and 446 characters, with the same behaviour.
Environment:
- Fluent-bit: 1.3.11
- Docker 19.03.6 (the rotation is configured to 32MB)
- Kubernetes: 1.16.6
- Operating System and version: Ubuntu 18.04
- Plugin config:
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_apps.db
Mem_Buf_Limit 64MB
Skip_Long_Lines On
Refresh_Interval 10
Tests
- 1 pod with ~1000msg/s (132 and 446 characters) -> works fine
- 1 pod with ~1250msg/s (132 characters) -> the described problems
- 2 pod with ~1000msg/s each (132 characters) -> the described problems
- 8 pod with ~250msg/s each (132 characters) -> the described problems
It seems that ~1000msg/s is valid for 1 host, regardless of the count of the pods, which produce that logs. Further we played around with the options for the tail plugin in all combination with no effect.
how often rotation is happening ?, what are the chances the file get rotated in between a refresh_interval ?
I did 3 tests. The first 2 with 32MB rotation size:
- 1 pod / 446 chars/line / ~1000msg/s : 1m 8s until first rotation -> WORKS
- 1 pod / 136 chars/line / ~2000msg/s : 1m 34s until first rotation -> DON'T WORK
And one with 128MB rotation size:
- 1 pod / 136 chars/line / ~2000msg/s : 6m 7s until first rotation -> DON'T WORK
I set up Refresh_Interval to 1s for some tests as well, which had no effect.
Please note this happens only if the tail plugin is configured to tail files from /var/log/containers which have symlinks.
In one of my other tests runs in minikube, docker daeamon.json was configured to rotate every 50mb and keep the latest 5 files. 1000000 lines were generated which took more than 10 seconds and my refresh interval was 1 second.
ls -lh --time-style=full-iso 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b/
-rw-r----- 1 root root 25M 2020-04-24 19:46:31.559009075 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log
-rw-r----- 1 root root 48M 2020-04-24 19:46:16.625763749 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.1
-rw-r----- 1 root root 48M 2020-04-24 19:45:49.753334243 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.2
-rw-r----- 1 root root 48M 2020-04-24 19:45:19.867398737 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.3
There is a reasonable gap of time between when the files are rotated. Only the contents of 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.3 were accessible. The rest of the files were not tailed.
Input Configuration:
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/plaintext-*.log
DB /var/log/flb_kube.db
Parser docker
Mem_Buf_Limit 500MB
Skip_Long_Lines On
Refresh_Interval 1
Rotate_Wait 10
Hey @thepythonista, @sneben ,
I am facing the similar kinda issue where running fluent bit to print these logs to stdout of k8 pod and have noticed if files are completed i mean application is done writing logs to the files then fluent bit parse the complete file and print it to stdout but if application is still writing logs to the file then fluent bit reads upto few lines only in one go when it scans the path, after that it never prints the logs which are newly added in the file.
is this related to the above explanation where pod is writing more than 1000msg/s ?
Any suggestion ?
I am seeing the same issue under high scale. Any workarounds/updates?
We are experiencing the same issue.
I send in 200k log messages, We are using docker in kubernets. The docker logrotation rotates every 50 MB with 5 total files max. I have increased Rotate_Wait to 240 and Refresh_Interval is 10.
Some rotation of files are detected, others not. Later the refresh finds rotated files which also feels wrong because all of them were rotated.
I think the problem could be that docker rotates it by renaming all other files. So when a file is rotated it get index 1 and all other files are renamed where the index is increased by one.Also fluentbit should not try to find these new files since they were rotated.
Not sure how to reproduce yet in a simple manner.
I have a similar issue (not sure if it is related or a different issue).
I have a PoC with low volumes (200 events), but Fluent Bit's tail input only fetches the first few log entries in each log file and then hangs. Then, when I restart Fluent Bit, it fetches the rest of the logs from the files. See this stack overflow entry for more details.
/watch
I am also running into a similar issue in which log files that are appended to are not captured by tail
and sent to ElasticSearch. Decreasing the refresh_interval
if anything makes things worse, as td-agent-bit
sees the file before anything is added to it and then promptly ignores any future updates.
As anyone in this thread found a workaround?
For my specific issue, I found it was actually a case of #3111, as my logs were stored in AWS EFS. Moving the logs outside of EFS solved the issue.
I've got the same problem as described by @thepythonista. We only want fluentbit to forward our application specific logs so have a tail path configured like this:
/var/log/containers/*_our-app-*.log".
The /var/log/containers dir contains a symlink for each container e.g. xyz-blue-6bc9c664f4-sg8b4_our-app-blue_xyz-blue-5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb.log
..which points to another symlink e.g. /var/log/pods/our-app-blue_xyz-blue-6bc9c664f4-sg8b4_9b5687c0-d62c-4e82-af28-c032069eb84c/xyz-blue/0.log
which points to the container log file currently being written to e.g. /var/lib/docker/containers/5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb/5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb-json.log.
When docker (or json-file) rotates the log file it suffixes the current log file name with .1 and renames the previous .1 file to .2 etc. I'm wondering if the chain of symlinks is preventing fluentbit from accessing or being aware of the *.log.1 file. Under high load we can see that the logs at the end of the *.log.1, *.log.2 etc files have not been forwarded. Assumption is that fluentbit has fallen behind reading logs from the current log file, so when the file is renamed, fluentbit can no longer access it or is unaware of it due to the use of symlinks that only direct it to the current log file.
For what its worth, we are using fluent-bit image 1.7.4, here is my service and output config:
[SERVICE]
Flush 1
Daemon Off
Log_Level debug
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
#
# HTTP server for providing metrics endpoint
#
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port {{ .Values.service.port }}
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem.limit 20M
[INPUT]
Name tail
Path "/var/log/containers/*_our-app-*.log"
Exclude_Path "fluent*.log"
Parser docker
Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(? <namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Tag ourcompany.k8s.<namespace_name>.<container_name>.<pod_name>.<docker_id>-
Refresh_Interval 5
Mem_Buf_Limit 500MB
Skip_Long_Lines On
DB /tmp/tail-containers-state.db
DB.Sync Normal
Docker_Mode On
storage.type filesystem
Rotate_Wait 15
Hi, Is there any specific solution for this issue. I am also facing similar issue where fluentbit is trying to scrape logs from mounted nfs dir. If I restart fluentbit, It will tail the logs again but stops after few mins.
Hello
We have the same issue on 1.8.13
@AlessioCasco , For me, this is resolved after setting Inotify watcher to false. check this link for more details
[INPUT]
Name tail
Inotify_Watcher false
I have almost the same problem. What should the fluent-bit settings be for high loads?
Yes, even for moderate (around 400 logs per sec) loads this happens. Anybody could help here?
We are also affected by this issue for pods as soon as they start logging over 1k messages per second. As soon as the load calms down, log messages are sent as expected, with 10MB docker rotation and following input conf:
[INPUT]
Name tail
Tag kube.<namespace_name>.<pod_name>_<container_name>-<docker_id>
Tag_Regex ^/var/log/containers/(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 10MB
Skip_Long_Lines On
Refresh_Interval 10
Docker_Mode On
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Not stale
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
No stale
I wonder if it may be related to https://github.com/fluent/fluent-bit/pull/6552
I wonder if it may be related to #6552
Any chance you could take a look at it @lecaros or give your perspective here? I'll also take a deeper look later. Any chance you can test again @thepythonista to see if it still happens with the latest version?
@pwhelan I don't think this is in a release yet but @thepythonista you can test with the latest version of master
container built overnight: ghcr.io/fluent/fluent-bit/master:x86_64 (don't use in production obviously).
Agree with @patrick-stephens. Please test this with the latest image (this was initially reported in 1.4.2) and share the repro in case you still see the issue.
We are hitting the same issue using fluent-bit 2.0.9 on kubernetes 1.24 .
Originally our configuration was:
[INPUT]
Name tail
Path /var/log/containers/*.log
Tag kube.*
multiline.parser cri
Mem_Buf_Limit 200MB
Skip_Long_Lines Off
DB /var/log/flb-storage/offset.db
Buffer_Max_Size 1M
but we end up whith this which little bit helps to reduce the issue:
[INPUT]
Name tail
Path /var/log/containers/*.log
Tag kube.*
multiline.parser cri
Mem_Buf_Limit 200MB
Skip_Long_Lines Off
DB /var/log/flb-storage/offset.db
Buffer_Max_Size 1M
Progress_Check_Interval 1ms
Inotify_Watcher false
Rotate_Wait 15
Refresh_Interval 1
Buffer_Chunk_Size 2k
We made a small test case which might helps pin down the issue:
At first we got our cluster node under little stress to generete some amount of logs traffic on it. Then we deployed our control container which emit sample logline every 10ms with unique incrementing ID. So at the end we could tell exactly which logs are missing. After the test run has finished we simply checked content of both '0.log' (current) and '0.log. <timestamp>' (last rotated) files on FS for our control container. The '0.log. <timestamp>' file end up with log number "76812" and '0.log' starts with log number "76813". Until now everything seems fine, cri runtime rotate logs correctly. But if we comapare it with elastic there are records between 76803 - 76825 missing. Assuming everytime log file is rotated fluen-bit not process some amount of records (at the end and at the beginning of rotated file). In our test context for the control container it result to only about 74000 of total 80000 logs has been delivered (6000 lost).
2.0.10 was out as well last week which has some further improvements in it.
Threaded input is now possible as well I believe: threaded on
with threaded on
we are hiting 7071 . By the way, right now we are running on v2.1.2, and so far, it looks like there is no improvement. The issue with quick file rotation seems to be still relevant.
Logs may be lost under high load. Fluent relies on inotify to passively receive file changes, and each notify fluent can only read a few lines. Therefore, under high load, the data source writes files to cause file changes, and fluent only processes a few lines after receiving the notification. Obviously, it cannot keep up with the writing speed.