fluent-bit
fluent-bit copied to clipboard
Segmentation fault on fluentbit containers
Bug Report
Describe the bug I noticed a lot of pods with a high restart count and the pod events had several errors with exit code 139 (segmentation fault).
Your Environment
-
Version used: 1.9.2 & 1.9.3 (tried on both)
-
Configuration:
Click to expand!
fluent-bit.conf: | [SERVICE] Flush 1 Log_Level info Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 @INCLUDE input.conf @INCLUDE filter.conf @INCLUDE output.conf input.conf: | [INPUT] Name tail Tag kube.* Path /var/log/containers/*.log Parser cri DB /var/log/flb_kube.db Mem_Buf_Limit 20MB Skip_Long_Lines On Refresh_Interval 10 Buffer_Chunk_Size 256k Buffer_Max_Size 10240k parsers.conf: | [PARSER] Name cri Format regex Regex ^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$ Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%L%z filter.conf: | [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Merge_Log On Keep_Log Off Labels On Annotations Off K8S-Logging.Parser Off K8S-Logging.Exclude Off [FILTER] Name modify Match kube.* Hard_rename log message [FILTER] Name modify Match kube.* Hard_rename msg message [FILTER] Name modify Match kube.* Hard_rename level severity [FILTER] Name lua Match kube.* script filter.lua call stacktrace_append [FILTER] Name modify Match kube.* Hard_rename trace logging.googleapis.com/trace Hard_rename spanId logging.googleapis.com/spanId Hard_rename traceSampled logging.googleapis.com/trace_sampled filter.lua: | function stacktrace_append(tag, timestamp, record) if record['message'] ~= nil and record['stacktrace'] ~= nil then record['message'] = record['message'] .. "\n" .. record['stacktrace'] return 2, timestamp, record end return 0, timestamp, record end output.conf: | [OUTPUT] Name stackdriver Match kube.* Tag_Prefix kube.var.log.containers. resource k8s_container k8s_cluster_name <redacted> k8s_cluster_location <redacted> autoformat_stackdriver_trace true severity_key severity log_name_key logName -
Environment name and version: Kubernetes 1.21.11-gke.1100
-
Filters and plugins: stackdriver
probably #5081. You can try to set workers 1 on the OUTPUT as a workaround.
Can you try running with valgrind? It should be available in *-debug images
probably https://github.com/fluent/fluent-bit/issues/5081. You can try to set workers 1 on the OUTPUT as a workaround.
this seems to solve it
Is it still there in the 1.9.4 images? These are live now, I think a few issues with the change to default workers were resolved but not sure if this is one.
@patrick-stephens still getting SIGSEGV in v1.9.4 with more than 1 worker on the stackdriver output.
~~Yeah, the change in 1.9.4 is to set the default workers to 1 for stackdriver which resolves this issue as there is no specification of workers here so when the default was changed it triggered #5081.~~
~~#5081 is still an issue but the change in defaults that then meant people who had not specified workers now get a failure is this issue which is resolved I believe.~~
~~@miguelvr does your original config now work with 1.9.4? That was my query so we could close this and leave #5081 to track that specific issue when workers is specified.~~
EDIT: of course now I say that I see the default is 2: https://github.com/fluent/fluent-bit/blob/v1.9.4/plugins/out_stackdriver/stackdriver.c#L2550 It was cloudwatch I was thinking of: https://github.com/fluent/fluent-bit/pull/5417
Setting workers to 1 will just be a mitigation as workers are essential for improving the throughput of the pipeline.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.