fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Segmentation fault on fluentbit containers

Open miguelvr opened this issue 3 years ago • 7 comments

Bug Report

Describe the bug I noticed a lot of pods with a high restart count and the pod events had several errors with exit code 139 (segmentation fault).

Your Environment

  • Version used: 1.9.2 & 1.9.3 (tried on both)

  • Configuration:

    Click to expand!
    fluent-bit.conf: |
       [SERVICE]
         Flush         1
         Log_Level     info
         Daemon        off
         Parsers_File  parsers.conf
         HTTP_Server   On
         HTTP_Listen   0.0.0.0
         HTTP_Port     2020
    
       @INCLUDE input.conf
       @INCLUDE filter.conf
       @INCLUDE output.conf
    
     input.conf: |
       [INPUT]
         Name              tail
         Tag               kube.*
         Path              /var/log/containers/*.log
         Parser            cri
         DB                /var/log/flb_kube.db
         Mem_Buf_Limit     20MB
         Skip_Long_Lines   On
         Refresh_Interval  10
         Buffer_Chunk_Size 256k
         Buffer_Max_Size   10240k
    
     parsers.conf: |
       [PARSER]
         Name          cri
         Format        regex
         Regex         ^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$
         Time_Key      time
         Time_Format   %Y-%m-%dT%H:%M:%S.%L%z
    
     filter.conf: |
       [FILTER]
         Name                kubernetes
         Match               kube.*
         Kube_URL            https://kubernetes.default.svc:443
         Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
         Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
         Merge_Log           On
         Keep_Log            Off
         Labels              On
         Annotations         Off
         K8S-Logging.Parser  Off
         K8S-Logging.Exclude Off
    
       [FILTER]
         Name          modify
         Match         kube.*
         Hard_rename   log message
    
       [FILTER]
         Name          modify
         Match         kube.*
         Hard_rename   msg message
    
       [FILTER]
         Name          modify
         Match         kube.*
         Hard_rename   level severity
    
       [FILTER]
         Name          lua
         Match         kube.*
         script        filter.lua
         call          stacktrace_append
    
       [FILTER]
         Name          modify
         Match         kube.*
         Hard_rename   trace         logging.googleapis.com/trace
         Hard_rename   spanId        logging.googleapis.com/spanId
         Hard_rename   traceSampled  logging.googleapis.com/trace_sampled
    
     filter.lua: |
       function stacktrace_append(tag, timestamp, record)
         if record['message'] ~= nil and record['stacktrace'] ~= nil then
             record['message'] = record['message'] .. "\n" .. record['stacktrace']
             return 2, timestamp, record
         end
         return 0, timestamp, record
       end
    
    output.conf: |
         [OUTPUT]
           Name                          stackdriver
           Match                         kube.*
           Tag_Prefix                    kube.var.log.containers.
           resource                      k8s_container
           k8s_cluster_name              <redacted>
           k8s_cluster_location          <redacted>
           autoformat_stackdriver_trace  true
           severity_key                  severity
           log_name_key                  logName
    
  • Environment name and version: Kubernetes 1.21.11-gke.1100

  • Filters and plugins: stackdriver

miguelvr avatar Jun 03 '22 18:06 miguelvr

probably #5081. You can try to set workers 1 on the OUTPUT as a workaround.

chicocvenancio avatar Jun 06 '22 00:06 chicocvenancio

Can you try running with valgrind? It should be available in *-debug images

tarruda avatar Jun 06 '22 17:06 tarruda

probably https://github.com/fluent/fluent-bit/issues/5081. You can try to set workers 1 on the OUTPUT as a workaround.

this seems to solve it

miguelvr avatar Jun 06 '22 17:06 miguelvr

Is it still there in the 1.9.4 images? These are live now, I think a few issues with the change to default workers were resolved but not sure if this is one.

patrick-stephens avatar Jun 06 '22 17:06 patrick-stephens

@patrick-stephens still getting SIGSEGV in v1.9.4 with more than 1 worker on the stackdriver output.

chicocvenancio avatar Jun 06 '22 19:06 chicocvenancio

~~Yeah, the change in 1.9.4 is to set the default workers to 1 for stackdriver which resolves this issue as there is no specification of workers here so when the default was changed it triggered #5081.~~

~~#5081 is still an issue but the change in defaults that then meant people who had not specified workers now get a failure is this issue which is resolved I believe.~~

~~@miguelvr does your original config now work with 1.9.4? That was my query so we could close this and leave #5081 to track that specific issue when workers is specified.~~

EDIT: of course now I say that I see the default is 2: https://github.com/fluent/fluent-bit/blob/v1.9.4/plugins/out_stackdriver/stackdriver.c#L2550 It was cloudwatch I was thinking of: https://github.com/fluent/fluent-bit/pull/5417

patrick-stephens avatar Jun 10 '22 22:06 patrick-stephens

Setting workers to 1 will just be a mitigation as workers are essential for improving the throughput of the pipeline.

JeffLuoo avatar Aug 11 '22 14:08 JeffLuoo

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Nov 10 '22 02:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Nov 16 '22 02:11 github-actions[bot]