fluent-bit Segmentation fault on fluentbit containers

Bug Report

Describe the bug I noticed a lot of pods with a high restart count and the pod events had several errors with exit code 139 (segmentation fault).

Your Environment

Version used: 1.9.2 & 1.9.3 (tried on both)

Configuration:

Click to expand!

fluent-bit.conf: |
   [SERVICE]
     Flush         1
     Log_Level     info
     Daemon        off
     Parsers_File  parsers.conf
     HTTP_Server   On
     HTTP_Listen   0.0.0.0
     HTTP_Port     2020

   @INCLUDE input.conf
   @INCLUDE filter.conf
   @INCLUDE output.conf

 input.conf: |
   [INPUT]
     Name              tail
     Tag               kube.*
     Path              /var/log/containers/*.log
     Parser            cri
     DB                /var/log/flb_kube.db
     Mem_Buf_Limit     20MB
     Skip_Long_Lines   On
     Refresh_Interval  10
     Buffer_Chunk_Size 256k
     Buffer_Max_Size   10240k

 parsers.conf: |
   [PARSER]
     Name          cri
     Format        regex
     Regex         ^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$
     Time_Key      time
     Time_Format   %Y-%m-%dT%H:%M:%S.%L%z

 filter.conf: |
   [FILTER]
     Name                kubernetes
     Match               kube.*
     Kube_URL            https://kubernetes.default.svc:443
     Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
     Merge_Log           On
     Keep_Log            Off
     Labels              On
     Annotations         Off
     K8S-Logging.Parser  Off
     K8S-Logging.Exclude Off

   [FILTER]
     Name          modify
     Match         kube.*
     Hard_rename   log message

   [FILTER]
     Name          modify
     Match         kube.*
     Hard_rename   msg message

   [FILTER]
     Name          modify
     Match         kube.*
     Hard_rename   level severity

   [FILTER]
     Name          lua
     Match         kube.*
     script        filter.lua
     call          stacktrace_append

   [FILTER]
     Name          modify
     Match         kube.*
     Hard_rename   trace         logging.googleapis.com/trace
     Hard_rename   spanId        logging.googleapis.com/spanId
     Hard_rename   traceSampled  logging.googleapis.com/trace_sampled

 filter.lua: |
   function stacktrace_append(tag, timestamp, record)
     if record['message'] ~= nil and record['stacktrace'] ~= nil then
         record['message'] = record['message'] .. "\n" .. record['stacktrace']
         return 2, timestamp, record
     end
     return 0, timestamp, record
   end

output.conf: |
     [OUTPUT]
       Name                          stackdriver
       Match                         kube.*
       Tag_Prefix                    kube.var.log.containers.
       resource                      k8s_container
       k8s_cluster_name              <redacted>
       k8s_cluster_location          <redacted>
       autoformat_stackdriver_trace  true
       severity_key                  severity
       log_name_key                  logName

Environment name and version: Kubernetes 1.21.11-gke.1100
Filters and plugins: stackdriver

Jun 03 '22 18:06 miguelvr

probably #5081. You can try to set workers 1 on the OUTPUT as a workaround.

Jun 06 '22 00:06 chicocvenancio

Can you try running with valgrind? It should be available in *-debug images

Jun 06 '22 17:06 tarruda

probably https://github.com/fluent/fluent-bit/issues/5081. You can try to set workers 1 on the OUTPUT as a workaround.

this seems to solve it

Jun 06 '22 17:06 miguelvr

Is it still there in the 1.9.4 images? These are live now, I think a few issues with the change to default workers were resolved but not sure if this is one.

Jun 06 '22 17:06 patrick-stephens

@patrick-stephens still getting SIGSEGV in v1.9.4 with more than 1 worker on the stackdriver output.

Jun 06 '22 19:06 chicocvenancio

~~Yeah, the change in 1.9.4 is to set the default workers to 1 for stackdriver which resolves this issue as there is no specification of workers here so when the default was changed it triggered #5081.~~

~~#5081 is still an issue but the change in defaults that then meant people who had not specified workers now get a failure is this issue which is resolved I believe.~~

~~@miguelvr does your original config now work with 1.9.4? That was my query so we could close this and leave #5081 to track that specific issue when workers is specified.~~

EDIT: of course now I say that I see the default is 2: https://github.com/fluent/fluent-bit/blob/v1.9.4/plugins/out_stackdriver/stackdriver.c#L2550 It was cloudwatch I was thinking of: https://github.com/fluent/fluent-bit/pull/5417

Jun 10 '22 22:06 patrick-stephens

Setting workers to 1 will just be a mitigation as workers are essential for improving the throughput of the pipeline.

Aug 11 '22 14:08 JeffLuoo

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Nov 10 '22 02:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Nov 16 '22 02:11 github-actions[bot]

fluent-bit fluent-bit copied to clipboard

Segmentation fault on fluentbit containers

Bug Report

fluent-bit
fluent-bit copied to clipboard