fluent-bit
fluent-bit copied to clipboard
Fluent-bit input plugin tail doesn't process all logs: scan_blog add(): dismissed:
Bug Report
Describe the bug
Fluent Bit is not processing all logs located in /var/log/containers/
.
To Reproduce The following messages are displayed:
[2021/10/01 14:40:05] [debug] [input:tail:tail.0] scanning path /var/log/containers/*.log │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/activator-85cd6f6f9-nrncf_knative-serving_activator-3b631f27f6667599ae940f94afe6a65a4d1d488e7979fced513fa910082a5ae1.log, inode 404768 │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/activator-85cd6f6f9-nrncf_knative-serving_activator-ca32320178170fe4198ce1b0bd57d8ea031c7c886a7b0e3d66bb1b78b67613b8.log, inode 921337 │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-agent-63659cdc8e5ddba3eaf729be280661b45fd198e6d2c7195965be85cdca81f41a.log, inode 536837 │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-agent-8726abf73577f597e15716176cfcdce442b159d00ec12f59e439719d824a9585.log, inode 1190181 │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-ovs-08045b767f2f8ee421b3b4d8d5b646b49b4e12199ae957cad178dd3d11670ec6.log, inode 663855
- Steps to reproduce the problem: Configuration details:
ServiceAccount:
rules:
- apiGroups:
- ""
resources:
- namespaces
- pods
verbs:
- get
- list
- watch
ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: vmware-system
labels:
k8s-app: fluent-bit
apiVersion: v1
data:
filter-kubernetes.conf: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[FILTER]
Name modify
Match kube.*
Copy kubernetes k8s
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_Under kubernetes
filter-record.conf: |
[FILTER]
Name record_modifier
Match *
Record tkg_cluster veba-demo.jarvis.tanzu
Record tkg_instance veba-demo.jarvis.tanzu
[FILTER]
Name nest
Match kube.*
Operation nest
Wildcard tkg_instance*
Nest_Under tkg
[FILTER]
Name nest
Match kube_systemd.*
Operation nest
Wildcard SYSTEMD*
Nest_Under systemd
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level debug
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
@INCLUDE input-kubernetes.conf
@INCLUDE input-systemd.conf
@INCLUDE input-kube-apiserver.conf
@INCLUDE input-auditd.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE filter-record.conf
@INCLUDE output-syslog.conf
input-auditd.conf: |
[INPUT]
Name tail
Tag audit.*
Path /var/log/audit/audit.log
Parser logfmt
DB /var/log/flb_system_audit.db
Mem_Buf_Limit 50MB
Refresh_Interval 10
Skip_Long_Lines On
input-kube-apiserver.conf: |
[INPUT]
Name tail
Tag apiserver_audit.*
Path /var/log/kubernetes/audit.log
Parser json
DB /var/log/flb_kube_audit.db
Mem_Buf_Limit 50MB
Refresh_Interval 10
Skip_Long_Lines On
input-kubernetes.conf: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
input-systemd.conf: |
[INPUT]
Name systemd
Tag kube_systemd.*
Path /var/log/journal
DB /var/log/flb_kube_systemd.db
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Systemd_Filter _SYSTEMD_UNIT=containerd.service
Read_From_Tail On
Strip_Underscores On
output-syslog.conf: |
[OUTPUT]
Name syslog
Match kube.*
Host 10.197.79.57
Port 514
Mode tcp
Syslog_Format rfc5424
Syslog_Hostname_key tkg_cluster
Syslog_Appname_key pod_name
Syslog_Procid_key container_name
Syslog_Message_key message
Syslog_SD_key k8s
Syslog_SD_key labels
Syslog_SD_key annotations
Syslog_SD_key tkg
[OUTPUT]
Name syslog
Match kube_systemd.*
Host 10.197.79.57
Port 514
Mode tcp
Syslog_Format rfc5424
Syslog_Hostname_key tkg_cluster
Syslog_Appname_key tkg_instance
Syslog_Message_key MESSAGE
Syslog_SD_key systemd
parsers.conf: |
[PARSER]
Name apache
Format regex
Regex ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name apache2
Format regex
Regex ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name apache_error
Format regex
Regex ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$
[PARSER]
Name nginx
Format regex
Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
Name docker-daemon
Format regex
Regex time="(?<time>[^ ]*)" level=(?<level>[^ ]*) msg="(?<msg>[^ ].*)"
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
# http://rubular.com/r/tjUt3Awgg4
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
[PARSER]
Name logfmt
Format logfmt
[PARSER]
Name syslog-rfc5424
Format regex
Regex ^\<(?<pri>[0-9]{1,5})\>1 (?<time>[^ ]+) (?<host>[^ ]+) (?<ident>[^ ]+) (?<pid>[-0-9]+) (?<msgid>[^ ]+) (?<extradata>(\[(.*)\]|-)) (?<message>.+)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
Name syslog-rfc3164-local
Format regex
Regex ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
Time_Keep On
[PARSER]
Name syslog-rfc3164
Format regex
Regex /^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$/
Time_Key time
Time_Format %b %d %H:%M:%S
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
[PARSER]
Name kube-custom
Format regex
Regex (?<tag>[^.]+)?\.?(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: vmware-system
labels:
k8s-app: fluent-bit
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: fluent-bit
template:
metadata:
labels:
k8s-app: fluent-bit
spec:
containers:
- image: projects.registry.vmware.com/tkg/fluent-bit:v1.6.9_vmware.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 2020
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: fluent-bit
ports:
- containerPort: 2020
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/v1/metrics/prometheus
port: 2020
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 80m
memory: 200Mi
requests:
cpu: 50m
memory: 100Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log
name: var-log
- mountPath: /var/log/pods
name: var-log-pods
- mountPath: /var/log/containers
name: var-log-containers
- mountPath: /var/lib/docker/containers
name: var-lib-docker-containers
readOnly: true
- mountPath: /fluent-bit/etc/
name: fluent-bit-config
- mountPath: /run/log
name: systemd-log
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: fluent-bit
serviceAccountName: fluent-bit
terminationGracePeriodSeconds: 10
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
volumes:
- hostPath:
path: /var/log
type: ""
name: var-log
- hostPath:
path: /var/log/pods
type: ""
name: var-log-pods
- hostPath:
path: /var/log/containers
type: ""
name: var-log-containers
- hostPath:
path: /var/lib/docker/containers
type: ""
name: var-lib-docker-containers
- hostPath:
path: /run/log
type: ""
name: systemd-log
- configMap:
defaultMode: 420
name: fluent-bit-config
name: fluent-bit-config
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
Expected behavior
All logs in /var/log/containers/
should be processed.
Your Environment
- Version used: 1.6.9, 1.7.9 and 1.8.9
- Configuration: See above
- Environment name and version Kubernetes Version: 1.20.2
- Server type and version: Single K8s instance See: https://github.com/vmware-samples/vcenter-event-broker-appliance
- Operating System and version: VMware PhotonOS v4
- Filters and plugins: @INCLUDE input-kubernetes.conf @INCLUDE input-systemd.conf @INCLUDE input-kube-apiserver.conf @INCLUDE input-auditd.conf @INCLUDE filter-kubernetes.conf @INCLUDE filter-record.conf @INCLUDE output-syslog.conf
Additional context
Running tail -f
manually from within the system on a specific pod log, which is writing to stdout
, works.
{"log":"10/03/2021 14:47:13 - Handler Processing Completed ...\n","stream":"stdout","time":"2021-10-03T14:47:13.829672574Z"}
{"log":"\n","stream":"stdout","time":"2021-10-03T14:47:13.829772103Z"}
Logs which e.g. aren't processed:
root@veba-kn [ /var/log/containers ]# ls -rtl
total 376
lrwxrwxrwx 1 root root 100 Sep 13 21:31 antrea-agent-gql5r_kube-system_antrea-agent-8726abf73577f597e15716176cfcdce442b159d00ec12f59e439719d824a9585.log -> /var/log/pods/kube-system_antrea-agent-gql5r_31aa406a-286c-495b-9dcf-e4036c
2a4756/antrea-agent/3.log
lrwxrwxrwx 1 root root 98 Sep 13 21:31 antrea-agent-gql5r_kube-system_antrea-ovs-3f300f1d7b28c069df1f34cf37ff89be95d69fc3dc4ea0f269b5bd07ce5d56c1.log -> /var/log/pods/kube-system_antrea-agent-gql5r_31aa406a-286c-495b-9dcf-e4036c2a
4756/antrea-ovs/3.log
lrwxrwxrwx 1 root root 102 Sep 13 21:31 envoy-89vct_contour-external_shutdown-manager-c8ed97927c25d465f31cce5ab8bd91d02742504f8cf73ad53e493738d0a17f74.log -> /var/log/pods/contour-external_envoy-89vct_1c947a55-2b86-48bd-b442-c6c51e
c2dd3a/shutdown-manager/3.log
lrwxrwxrwx 1 root root 91 Sep 13 21:31 envoy-89vct_contour-external_envoy-0ea7a33d12105058f74eae9653dd0266ac99ef2ba7f6cb3a3b04a8ec3bc02525.log -> /var/log/pods/contour-external_envoy-89vct_1c947a55-2b86-48bd-b442-c6c51ec2dd3a/envo
y/3.log
lrwxrwxrwx 1 root root 104 Sep 13 21:31 contour-5869594b-7jm89_contour-external_contour-803e6591f657fae9539b64ae4f86fa44cce99b409c5f92979c6045cf4b98b52c.log -> /var/log/pods/contour-external_contour-5869594b-7jm89_cc6cf243-7d3f-483
9-91e8-741ab87f6488/contour/3.log
lrwxrwxrwx 1 root root 106 Sep 13 21:31 contour-5d47766fd8-n24mz_contour-internal_contour-ae34a8ae0b8398da294c5061ec5c0ef1e9be8cb2979f07077e5e9df12f2bab67.log -> /var/log/pods/contour-internal_contour-5d47766fd8-n24mz_a87131ad-d73a
-4371-a47b-dcc410f3b6e4/contour/3.log
lrwxrwxrwx 1 root root 100 Sep 13 21:31 coredns-74ff55c5b-mjdlr_kube-system_coredns-60bd5f49def85a0ddc929e2c2da5c793a3c6de00cd6a81bdcfdb21f3d4f45129.log -> /var/log/pods/kube-system_coredns-74ff55c5b-mjdlr_7ef260c1-308e-4162-8a84-2
31d560f8023/coredns/3.log
I've also tried running the DS in
securityContext:
privileged: true
Similar issues I found but which doesn't provide the solution for this issue: #3857 #4014
Your help would be much appreciated. Thanks
@rguske I also have this problem, how did you solve it?
@rguske I also have this problem, how did you solve it?
Hi @zy286938348, I have no clue at the moment but I will follow some of the made suggestions here: https://fluent-all.slack.com/archives/C0CTQGHKJ/p1637005220493300 Please keep me posted if you have more luck with finding the solution for this particular problem. Thanks
Hi @zy286938348, which container runtime are you using?
@rguske @zy286938348 I am also seeing this issue using containerd as my runtime.
It came up ~16 days after my first deployment of fluentbit 1.8.1
on a fresh k8s 1.21
cluster on EKS.
My fluentbit configuration is basically the default helm chart install, with the kubernetes parser and thats about it.
It looks like I can't get Slack thread above @rguske - let me know if I can share more info to help figure this out.
I think this is the RCA for me: https://github.com/fluent/fluent-bit/issues/3873
When I removed Ignore_Older
the issues I was seeing went away quite quickly.
Please tag me if you are able to try this and confirm it works for you too.
Hi @kc-dot-io,
I appreciate your comment and help. I wanted to update this issue already last week with my latest observations but last week was a busy one. However, it's interesting that you're mentioning that you are seeing this issue with containerd
since I figured out, that changing the runtime from docker
to containerd
finally provided positive results. Subsequently, I was pretty sure it's due to the old dockershim
.
Regarding the use of Ignore_Older
- I'm sure I've never added it to the CM
but I will double check this next week.
Again, thanks a lot.
I also have this issue. Upgrading Fargate 1.3(using Docker as runtime) to 1.4 (using containerd) did not help.
Circling back to this and I'm no longer certain that Ignore_Older
is the RCA for me & yes can confirm I'm using containerd.
The issue still appears to occur from time to time and while it's not limited to a single node usually, I have observed the circumstance where one node will work while another isn't. I'm now testing increasing the resources that are being allocated to the fluebt bit replicas since the issue appears to come and go.
If anyone has any more ideas, I'm happy to troubleshoot it with my setup. Please do ping me in follow ups.
Is there any updates on this issue? I am facing the same on K8s 1.21 deployment with Fluent Bit v1.8.12 and elasticsearch.
@srideok2 I don't think there is a permanent solution yet but I've talked to a few other people who had issues and we've had luck with reconfiguring our fluentbits to get to a place where things were stable.
From what I can see the issue mostly happens when buffer sizes get exceeded due to large log lines.
The solution I used and worked for at least one other person was to turn on the option for skipping long lines and then jack up the buffer size on the inputs and filters.
IRC there is another open issue that speaks to the issue at a lower level and I think there are multiple different symptoms of them same root cause here, which is basically that when the memory buffer is exceed, fluent-bit just silently stops working until it can be reset.
Ping me on Slack community if you have any questions. I hope this helps.
Hi all,
it finally worked for our project after replacing Docker
with Containerd
and by apply
ing the following config:
https://github.com/vmware-samples/vcenter-event-broker-appliance/tree/development/files/configs/fluentbit/templates
Has this been resolved? This seems to be related to not removal of inode during file removal process i.e. (the inode entry not being removed from files_static, files_event[https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_config.h#L129, https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_file.c#1108])
@edsiper can you please give us some pointers here to debug this issue. As I have understood https://github.com/fluent/fluent-bit/blob/cd92cf36099e69c09b0448e6bc0032bd4ac81ac5/plugins/in_tail/tail_file.c#L861 this method is returning -1 and the files are getting dismissed. can you please help with some pointers on understanding when does this method return -1.
There is a bug in v1.8.13, we have just published the new container image for v1.8.14, please upgrade:
fluent/fluent-bit:1.8.14
I'm using version 1.9.0 and I'm getting dismissed logs. This doesn't really seems fixed.
Same issue here. Openshift 4 CoreOS and cri-o. Any input regarding this would be great.
@edsiper can you please give us some pointers here to debug this issue. As I have understood
https://github.com/fluent/fluent-bit/blob/cd92cf36099e69c09b0448e6bc0032bd4ac81ac5/plugins/in_tail/tail_file.c#L861
this method is returning -1 and the files are getting dismissed. can you please help with some pointers on understanding when does this method return -1.
This got fixed when we started using 1.9.0. @danielserrao
@ramatudaan i switched my fluent-bit image to 1.9.0 and my issue now seems to be security/permission related or missconfiguration of templates to match Openshift 4 CoreOS/cri-o. I created and issue here on it: here
Hello guys, is there any ETA on this @edsiper? Sound like a lot of issues have been opened about this topic and debian packages 1.8.X are not available anymore. All 1.9.X packages are not fixing the issue.
Same issue guys, tail logging not works
Have same issue. Tried both output as http to logstash and es plugin to ES
fluent-bit版本:1.9.3 问题:使用fluent-bit 的tail插件收集日志时,fluent-bit运行一段时间后,会停止继续收集日志 问题原因:使用multiline.parser进行日志合并时,cont_state可能将若干条日志合并为一行,导致tail停止采集 解决思路: multiline.parser进行日志合并时,受限使用start_state判断是否为开始行,然后使用cont_state判断后续行是否需要合并,因此state_state和cont_state应采用互斥的逻辑, 正确示范: rule "start_state" "/^(\d{4})(.)/" "cont" rule "cont" "/^(?!\d{4})(.)/" "cont" 错误示范: rule "start_state" "/^(\d{4})(.)/" "cont" rule "cont" "/(.)/" "cont"
目前暂时没有遇到tail停止采集的问题
Has anyone tried turning Skip_Long_Lines
to off
?
Some of my logs were dismissed, because they were too long I guess, this did the trick
I'm having this issue as well on 1.9.6. The problem is that I NEED the long lines to be processed.
@Jonathan-w6d yes this helped for me. Also increasing the max buffer size. That said you'll still run into an issue with a memory leak and eventually need to restart the process. I have to do this every 90-120 days.
@zulrang Try increasing your max buffer size to something very large - you'll probably still have to restart the processes every so often, but in theory if they buffer is large enough it should be able to process long lines until it can't again.
@zulrang Try increasing your max buffer size to something very large - you'll probably still have to restart the processes every so often, but in theory if they buffer is large enough it should be able to process long lines until it can't again.
I have all buffers set to 2M and it has never successfully flushed chunks from one specific node - the one that is logging stack traces (about 8k long). The other 13 work fine.
I'm having the same issue, tried all suggestions in this thread bar catainerd. Tail is dismissing the one and only log I want it to tail. Have tried downgrading to 1.6.10 and upgrading to various versions above up to latest. The problem occurs in every version I now try.
Eks 1.21, docker, Amazon Linux 2 node group, latest helm install from bitnami
I have all buffers set to 2M and it has never successfully flushed chunks from one specific node - the one that is logging stack traces (about 8k long). The other 13 work fine.
@zulrang did you try going higher? I went all the way up to ~25M using the tail input, k8s filter and skip long lines off before it was stable.
Either there's something seriously wrong with FluentBit or I'm fundamentally misunderstanding how it works If it requires a 25M buffer to process a few 10k long lines
@kc-dot-io I have a 100M buffer, Skip_Long_Lines is Off and yesterday I had dismissed lines again, buffer was full, I had to rollout fluent bit. I don't like this.