fluent-bit
fluent-bit copied to clipboard
fluent-bit v3 occassionally crashes with SIGSEGV
Bug Report
Describe the bug
Every now and then we see a crash on one of our fluent-bit pods in our Kubernetes clusters.
These crashes started since we upgraded from fluent-bit version version 2.2.2
(which was rock stable) to version 3.0.2
and still persisting in version 3.0.3
.
To Reproduce
The logs for the lastest crash look like so:
$ kubectl logs -p fluent-bit-5fj89
Fluent Bit v3.0.3
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
___________.__ __ __________.__ __ ________
\_ _____/| | __ __ ____ _____/ |_ \______ \__|/ |_ ___ _\_____ \
| __) | | | | \_/ __ \ / \ __\ | | _/ \ __\ \ \/ / _(__ <
| \ | |_| | /\ ___/| | \ | | | \ || | \ / / \
\___ / |____/____/ \___ >___| /__| |______ /__||__| \_/ /______ /
\/ \/ \/ \/ \/
[2024/05/01 01:00:55] [engine] caught signal (SIGSEGV)
#0 0x55b2289a4c69 in flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dynamic_field.c:210
#1 0x55b2289a4c69 in flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_field.c:240
#2 0x55b2289a2bcc in flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3 0x55b2289d1f3f in ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4 0x55b2289d1f3f in ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5 0x55b228987797 in flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1515
#6 0x55b228987ef5 in flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7 0x55b2289a5c3c in flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8 0x55b2289d26cc in flb_tail_file_remove() at plugins/in_tail/tail_file.c:1249
#9 0x55b2289ce425 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x55b2289578d4 in flb_input_collector_fd() at src/flb_input.c:1949
#11 0x55b228971547 in flb_engine_handle_event() at src/flb_engine.c:575
#12 0x55b228971547 in flb_engine_start() at src/flb_engine.c:941
#13 0x55b22894d193 in flb_lib_worker() at src/flb_lib.c:674
#14 0x7fa5729aaea6 in ???() at ???:0
#15 0x7fa57225ea6e in ???() at ???:0
#16 0xffffffffffffffff in ???() at ???:0
The stack trace always looks the same and there are no errors logged beforehand.
- Steps to reproduce the problem:
I don't know how to reproduce the problem as it seems to happen at random times.
Expected behavior
No crashes.
Screenshots
Your Environment
- Version used:
Version v3.0.3
- Configuration:
We deploy fluent-bit via the helm chart hosted at https://fluent.github.io/helm-charts The values we provide to it follow.
config:
service: |
[SERVICE]
Daemon Off
Flush 5
HTTP_Server On
Log_Level warn
Parsers_File custom_parsers.conf
Parsers_File parsers.conf
inputs: |
[INPUT]
Buffer_Chunk_Size 800k
Buffer_Max_Size 20MB
DB /var/log/containers/fluent-bit.db
DB.locking true
Multiline.parser cri, docker
Name tail
Path /var/log/containers/*.log
Skip_Long_Lines Off
Tag kube.*
[INPUT]
Buffer_Chunk_Size 512k
Buffer_Max_Size 5MB
DB /var/log/kube-audit.db
DB.locking true
Name tail
Parser kube_audit
Path /var/log/kube-audit.log
Tag kube_audit
[INPUT]
Name systemd
Read_From_Tail On
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Tag host.*
filters: |
[FILTER]
# If the log has been broken up into multiple lines, rejoin them.
Match *
Mode partial_message
Multiline.key_content log
Name multiline
[FILTER]
# Enrich with kube data
Annotations Off
Buffer_Size 512k
K8S-Logging.Parser On
Labels Off
Match kube.*
Merge_Log On
Name kubernetes
[FILTER]
Name lua
Match kube.*
# lua function to get the size of a log message
script /fluent-bit/scripts/functions.lua
call add_size
# Put rewrite_tag rules early to reduce the amount of work required.
# That's because a new message is produced with the new tag and gets processed by all the filters again starting from the top.
# However it needs to be after the `kubernetes` filter since that requires the original tag name to enrich the data.
# Ruby regex format can be tested at https://rubular.com/
# Spaces are not allowed in the regex since it needs to be one contiguous string so use \s instead.
[FILTER]
# Retag Apache logs where the timestamp has the format of %d/%b/%Y:%H:%M:%S.%L instead of just %d/%b/%Y:%H:%M:%S
# The new tag isn't called `apache-high-precision` because the *apache* match later chooses the heweb_apache parser.
Match *apache*
Name rewrite_tag
# $KEY REGEX NEW_TAG KEEP
Rule $log ^[^\[]+\[\d{1,2}\/[^\/]{3}\/\d{4}:\d\d:\d\d:\d\d\.\d+\s\+\d{4}\] web-high-precision false
[FILTER]
# Retag Nginx logs where the timestamp has the format of %d/%b/%Y:%H:%M:%S.%L instead of just %d/%b/%Y:%H:%M:%S
# The new tag isn't called `nginx-high-precision` because the *nginx* match later chooses the nginx_common parser.
Match *nginx*
Name rewrite_tag
# $KEY REGEX NEW_TAG KEEP
Rule $log ^\S+\s\S+\s\S+\s\[\d{1,2}\/[^\/]{3}\/\d{4}:\d\d:\d\d:\d\d\.\d+\s\+\d{4}\] ngx-high-precision false
[FILTER]
Match *php*
Name rewrite_tag
# $KEY REGEX NEW_TAG KEEP
Rule $channel AccessLogs Security_logs false
Rule $channel Security Security_logs false
[FILTER]
# Lift the kubernetes records so we can remove the unrequired ones
Add_prefix kubernetes.
Match kube.*
Name nest
Nested_under kubernetes
Operation lift
[FILTER]
# Remove unrequired kubernetes fields
Match kube.*
Name record_modifier
Remove_key kubernetes.container_hash
Remove_key kubernetes.docker_id
Remove_key kubernetes.container_image
[FILTER]
# We plan to remove the php-logs container. This is a hack to keep existing saved searches and alerts working when we do.
Condition Key_Value_Equals kubernetes.container_name php-fpm
Condition Key_Exists monolog_level
Match kube.*
Name modify
Set kubernetes.container_name php-logs
[FILTER]
# Lift the kubernetes.labels as well so we can modify the `app` one.
Add_prefix kubernetes.labels.
Match kube.*
Name nest
Nested_under kubernetes.labels
Operation lift
[FILTER]
# Rename the `app` label to `app.kubernetes.io/name` to stop the clash with all the `app.kubernetes.io/*` labels.
Hard_rename kubernetes.labels.app kubernetes.labels.app.kubernetes.io/name
Match kube.*
Name modify
[FILTER]
# Nest the Kubernetes labels back again.
Match kube.*
Name nest
Nest_under kubernetes.labels
Operation nest
Remove_prefix kubernetes.labels.
Wildcard kubernetes.labels.*
[FILTER]
# Nest the kubenetes records back under kubernetes
Match kube.*
Name nest
Nest_under kubernetes
Operation nest
Remove_prefix kubernetes.
Wildcard kubernetes.*
[FILTER]
# Send apache logs to the heweb_apache parser
Key_Name log
Match *apache*
Name parser
Parser heweb_apache
Preserve_Key True
Reserve_Data True
[FILTER]
# Send apache logs with a high time precision to the heweb_apache parser
Key_Name log
Match *web-high-precision*
Name parser
Parser heweb_apache_high_precision
Preserve_Key True
Reserve_Data True
# Fix up logs so that all of them can be ingested by Elasticsearch.
[FILTER]
Name lua
Match *
script /fluent-bit/scripts/functions.lua
call clean_logs
[FILTER]
# A number of fields share their names with other JSON logs being imported but with clashing data types.
# Just place all of them below a `kube_audit` key to avoid that.
Match kube_audit
Name nest
Nest_Under kube_audit
Operation nest
Wildcard *
[FILTER]
# Send nginx logs to the nginx_common parser
Key_Name log
Match *nginx*
Name parser
Parser nginx_common
Preserve_Key True
Reserve_Data True
[FILTER]
# Send Nginx logs with a high time precision to the nginx_high_precision parser
Key_Name log
Match *ngx-high-precision*
Name parser
Parser nginx_high_precision
Preserve_Key True
Reserve_Data True
[FILTER]
# regex stdout resque log format
Exclude log ^\[notice\]*
Match *heweb_resque*
Name grep
[FILTER]
Add_prefix logEvent_
Match *elasticsearch*
Name nest
Nested_under logEvent.url
Operation lift
[FILTER]
# regex stdout resque log format
Key_Name log
Match *heweb_resque*
Name parser
Parser resque_stdout
Preserve_Key True
Reserve_Data True
[FILTER]
# Populate cluster_name field
Match *
Name record_modifier
Record cluster_name prod2.he0.io
[FILTER]
# Drop unneeded records as docker handles ts
# These fields result in ES errors
Match *
Name record_modifier
Remove_key params
Remove_key time
Remove_key ts
Remove_key url
Remove_key logEvent
Remove_key headers
Remove_key host
Remove_key _p
[FILTER]
# parse node-local-dns records
Key_Name log
Match *node-cache*
Name parser
Parser coredns
Preserve_Key True
Reserve_Data True
[FILTER]
# parse coredns records
Key_Name log
Match *coredns*
Name parser
Parser coredns
Preserve_Key True
Reserve_Data True
[FILTER]
# parse redash logs
Key_Name log
Match *redash*
Name parser
Parser redash
Preserve_Key True
Reserve_Data True
[FILTER]
Name record_modifier
Match *system*
Remove_key node
Remove_key pod
Remove_key resource
Remove_key service
[FILTER]
Name modify
Match Security_logs
Remove_wildcard kubernetes
[FILTER]
# Rename the source field from kubernetes-event-exporter to avoid a type clash.
Match *kube-event-exporter*
Name modify
Rename source kube_event_source
[FILTER]
# Rename the policy field from kyverno to avoid a type clash.
Match *kyverno*
Name modify
Rename policy kyverno_policy
[FILTER]
# Give Amazon trace IDs a consistent name.
# This is at the end after all the custom parsers have been chosen since some of them produce X_Amzn_Trace_Id
# Unfortunately they can't produce `X-Amzn-Trace-Id` because a hyphen is an illegal character for the regex group names.
Match *
Name modify
Rename request_X-Amzn-Trace-Id X-Amzn-Trace-Id
Rename X_Amzn_Trace_Id X-Amzn-Trace-Id
outputs: |
[OUTPUT]
compress gzip
Format json
Host logstash-cluster-shipper.elasticsearch.svc.cluster.local
Match *
Name http
Port 443
Retry_Limit 120
tls On
tls.verify Off
json_date_format iso8601
[OUTPUT]
# Send to Elasticsearch security logs
Generate_ID On
Host elasticsearch.prod1.apps.he0.io
Include_Tag_Key On
Index security-logs-ilm
Match Security_logs
Name es
Port 443
Replace_Dots On
storage.total_limit_size 10G
Suppress_Type_Name On
tls On
tls.verify Off
Trace_Error On
customParsers: |
[PARSER]
Name coredns
Format regex
Regex ^\[(?<level>\S*)\] (?<remote>.*):(?<port>\S*) - (?<coredns_id>\S*) "(?<type>\S*) (?<coredns_class>\S*) (?<coredns_name>\S*) (?<coredns_proto>\S*) (?<size>\S*) (?<do>\S*) (?<buffer_size>\S*)" (?<rcode>\S*) (?<rflags>\S*) (?<rsize>\S*) (?<duration>\S*)
[PARSER]
Name heweb_apache
Format regex
Key_Name log
Regex ^(?<hosts>(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+)(?:(?:,\s*(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+))?)*|-) - (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+) (?<url_path>.*?) HTTP[^/]*/[.\d]+" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*) "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)" (?<http_request_time_us>\S*) "(?<Origin>[^\"]*)" "(?<X_Amzn_Trace_Id>[^\"]*)"
Time_Format %d/%b/%Y:%H:%M:%S %z
Time_Key time
[PARSER]
Name heweb_apache_high_precision
Format regex
Key_Name log
Regex ^(?<hosts>(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+)(?:(?:,\s*(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+))?)*|-) - (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+) (?<url_path>.*?) HTTP[^/]*/[.\d]+" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*) "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)" (?<http_request_time_us>\S*) "(?<Origin>[^\"]*)" "(?<X_Amzn_Trace_Id>[^\"]*)"
Time_Format %d/%b/%Y:%H:%M:%S.%L %z
Time_Key time
[PARSER]
Name klog
Format regex
Regex (?<ID>\S*)\s(?<ts>\d{2}:\d{2}:\d{2}\.\d{6})\s*(?<line>\d*)\s(?<file>\S*])\s(?<message>.*)
# Command | Decoder | Field | Optional Action |
# ==============|=============|=======|===================|
Decode_Field_As escaped log try_next
Decode_Field_As escaped_utf8 log
[PARSER]
Format json
Name kube_audit
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
Time_Keep true
Time_Key requestReceivedTimestamp
[PARSER]
Name nginx_common
Format regex
Regex ^(?<client_ip>\S*) (?<server_domain>\S*) (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+)(?: +(?<url_path>[^\"]*?)(?: +\S*)?)?" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*)(?: "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)")? (?<http_request_time>\S*) (?<http_request_bytes>\S*) (?<X_Amzn_Trace_Id>\S*) (?<X_Forwarded_For>.*)
Time_Format %d/%b/%Y:%H:%M:%S %z
Time_Key time
[PARSER]
Name nginx_high_precision
Format regex
Regex ^(?<client_ip>\S*) (?<server_domain>\S*) (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+)(?: +(?<url_path>[^\"]*?)(?: +\S*)?)?" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*)(?: "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)")? (?<http_request_time>\S*) (?<http_request_bytes>\S*) (?<X_Amzn_Trace_Id>\S*) (?<X_Forwarded_For>.*)
Time_Format %d/%b/%Y:%H:%M:%S.%L %z
Time_Key time
[PARSER]
Name resque_stdout
Format regex
Regex ^\[(?<severity>\S*)\] .*\(Job\{(?<queue>\S*)\} \| ID\: (?<JobID>\S*) \| (?<class>\S*) \| \[(?<jsonData>.*)\]\)(?<message>.*)
[PARSER]
Name redash
Format regex
Regex ^\[(?<timestamp>.*)\]\[PID\:(?<pid>.*)\]\[(?<level>.*)\]\[(?<output>.*)\] method=(?<method>\S*) path=(?<path>\S*) endpoint=(?<endpoint>\S*) status=(?<status>\S*) content_type=(?<content_type>\S*)( charset=(?<charset>\S*))? content_length=(?<content_length>\S*) duration=(?<duration>\S*) query_count=(?<query_count>\S*) query_duration=(?<query_duration>\S*)
dnsConfig:
options:
- name: ndots
value: "1"
image:
tag: 3.0.3
luaScripts:
functions.lua: |
{{- (readFile "fluent-bit.lua") | nindent 4 }}
priorityClassName: daemon-sets
resources:
limits:
memory: 256Mi
requests:
cpu: 120m
memory: 48Mi
service:
annotations:
service.kubernetes.io/topology-mode: auto
tolerations:
- operator: Exists
effect: NoExecute
- operator: Exists
effect: NoSchedule
updateStrategy:
rollingUpdate:
maxUnavailable: 20%
type: RollingUpdate
- Environment name and version (e.g. Kubernetes? What version?):
Kubernetes
version v1.28.9
- Server type and version:
AWS EC2 instances.
- Operating System and version:
Kubernetes is running on Ubuntu 20.04 hosts.
- Filters and plugins:
Should be covered by the configuration shown above.
Additional context
If you think the crashes could be caused by functions.lua then I can also supply the content for that.
I am also experiencing this exact issue with Fluent Bit version 3.0.3, running on our self-hosted Kubernetes cluster (Kubernetes version: v1.28.3, Host: Ubuntu 22.04), Thanks!
Facing the same issue on aarch64 (ARMv8) nodes in a Kubernetes 1.25 environment since version 3 (also tried latest 3.0.4).
UPDATE: we are currently still using 2.2.3 due to this error. In 2.2.3 this error happens too but very rarely in comparison to v3. So, maybe some related library/framework used in the background changed and this info helps to figure out which one.
I have same issue, too. We were using 2.1.3 and there were many crash in the past. I updated it to latest 3.1.6 and the crash still happens.
From my observation, this issue happens frequently on nodes where we create pod frequently. On node where pod rarely killed or restarted have no issue. On Grafana, currently I have set memory limit to 2GB and I don't see out of memory from Grafana.
Now I'm trying 2.2.2 and 2.2.3 as suggested by above comments.
We are experiencing the same issue described in the comment above, with fluent-bit crashing on the nodes where pods are frequently created. We also tested with fluent-bit version 3.1.8 and still observed the issue.