fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe

Open grep4error opened this issue 5 years ago • 63 comments

Bug Report

My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters. Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port. After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe. Here’s the stack trace

0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055a127ecad01 in flb_output_return (ret=1, th=0x7fac548cd240) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:545
#2  0x000055a127ecade9 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:576
#3  0x000055a127eccca1 in cb_es_flush (data=0x7fac5991e0b2, bytes=3842,
    tag=0x7fac5437fd20 "kube.gws.gws-platform-datacollector-blue-6c474d7c84-wsdcw.gws-platform-datacollector-blue.2db22adbd3daaaf836a3f6311f4b3e5ad9ec7727280458ac68868419fb758ab9", tag_len=154,
    ins=0x7fac5949e480, out_context=0x7fac56c6bc00, config=0x7fac59432c80) at /tmp/fluent-bit/plugins/out_es/es.c:748
#4  0x000055a127e72649 in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:449
#5  0x000055a1282a6907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x3039663238613165 in ?? ()
#

strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)

...
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\200\25\20\2\0\0\0", 8)    = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\200\25\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.

I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.

flb_engine.c:

    /*
     * Create a communication channel: this routine creates a channel to
     * signal the Engine event loop. It's useful to stop the event loop
     * or to instruct anything else without break.
     */
    ret = mk_event_channel_create(config->evl,
                                  &config->ch_manager[0],
                                  &config->ch_manager[1],
                                  config);
    if (ret != 0) {
        flb_error("[engine] could not create manager channels");
        return -1;
    }

    flb_pipe_set_nonblocking(&config->ch_manager[1]); /* <----- I made it non-blocking ------- */

there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.

Environment

Version used: 1.5.7 (container fluent/fluent-bit:1.5.7) kubernetes 1.16.13 docker 3.0.10+azure Ubuntu 16.04.1

grep4error avatar Oct 09 '20 02:10 grep4error

Awesome Thanks @grep4error for all help. @edsiper This affecting more people https://github.com/fluent/fluent-bit/issues/2621 & https://github.com/fluent/fluent-bit/issues/2577. Please do the needful. We can help to validate fix. cc @tomerleib

avdhoot avatar Oct 09 '20 06:10 avdhoot

I've done two more tests in this area:

  1. Downgraded the image to 1.3.6 - There was a slight improvement, only one pod encountered the issue and froze. 2/3 pods of the Daemonset continue to work.
  2. I've removed all the tails and left a single input and a single stdout output - No issues, Fluent-Bit is running for almost 3 days and nothing stopped.

tomerleib avatar Oct 11 '20 06:10 tomerleib

@grep4error do you have an image that you can share with me and I will test it in my environment as well?

tomerleib avatar Oct 11 '20 12:10 tomerleib

edited No changes for me with 1.5.7 and this line. Still stopped processing after 20-30 minutes.

However, I created an image for 1.7.0 (yes, I know it's not released, but I forgot to change the branch) and everything worked for more than 14 hours.

tomerleib avatar Oct 12 '20 03:10 tomerleib

We have the same issue, for now we will revert to 1.3.X which seems not having this issue.

mtparet avatar Oct 12 '20 10:10 mtparet

@mtparet which 1.3.x version have you tested? I tested with 1.3.7 and saw the same issue, although not entirely (2/3 pods are running and 1 froze) but still.

tomerleib avatar Oct 12 '20 10:10 tomerleib

It was 1.3.9, I cannot guarantee we had no freeze but I did not observed it.

mtparet avatar Oct 12 '20 14:10 mtparet

Tested 1.3.9, after 28 hours one of the fluent-bit pods froze again...

tomerleib avatar Oct 15 '20 04:10 tomerleib

Hello @edsiper, Do you acknowledge the issue and so we need to find a fix for it ? Or this issue should not happen and there is something else outside of fluentbit that is broken.

mtparet avatar Nov 16 '20 13:11 mtparet

FYI: I am taking a look at this (WIP)

edsiper avatar Nov 16 '20 13:11 edsiper

I am thinking about a solution, making the socket async is right, but when EAGAIN is returned when trying to write to the pipe in a full state, will need an extra care, since that notification from the output saying "I am done, or need a retry", will be missed.

Work in process

edsiper avatar Nov 16 '20 13:11 edsiper

Indeed at high load ch_manager get's saturated, the obvious solution seems to implement a pool of pipes/channels for the output plugins. Now we were abusing the internal event loop channel manager and it was quite OK until a certain load, I will work in a POC with independent notification channels.

Work in process

edsiper avatar Nov 16 '20 15:11 edsiper

Thanks a lot of clarifying and your work on that ! @edsiper

mtparet avatar Nov 16 '20 15:11 mtparet

Indeed at high load ch_manager get's saturated

Increasing CPU allocated could be a workaround ?

mtparet avatar Nov 16 '20 15:11 mtparet

not necessary, it's a bit to have separate channels for that specific kind of notifications.

edsiper avatar Nov 16 '20 16:11 edsiper

@grep4error @avdhoot @tomerleib @mtparet @shirolimit @clebs

I've pushed a possible solution in branch ch_manager, would you please build it and test it? If you confirm the fix is good to go, I can merge it in the 1.6 series this week, but I need your help with proper feedback.

About the solution: now every output plugin instance has it owns channels to notify events through the event loop, in this way it does not saturate the main engine ch_manager.

edsiper avatar Nov 17 '20 03:11 edsiper

ping, any feedback ?

edsiper avatar Nov 20 '20 16:11 edsiper

I didn't take the time to build/push/test it, I will.

mtparet avatar Nov 23 '20 09:11 mtparet

@edsiper do you have any recommendations for a workaround when run on kubernetes? Basically, detecting a freeze and restarting is some sort of a temporary solution. One option seems to be creating an additional file output and configuring the k8s livecheck to verify if this file is regularly updated. If not, then restart. Does this sound like a viable option to you? Or maybe something like that relying on the prometheus metrics: https://github.com/fluent/fluent-bit/issues/2591

        livenessProbe:
          exec:
            command:
            - /bin/bash
            - -exc
            - >
              function getOutTotalF0() { curl -s --connect-timeout 5 --retry 3 -f http://localhost:65012/api/v1/metrics/prometheus |grep '^fluentbit_output_proc_records_total{name="forward.0"}' | awk '{print $2}' ;};
              T=$(getOutTotalF0);test -e ./m_oprtf0 && { test $(cat ./m_oprtf0) -ge $T && { echo "OutputRecordsTotal Forward0 freeze: $T"; exit 1;} ;}; test $T -ge -10 && { echo "$T" >| ./m_oprtf0; } || true;
          initialDelaySeconds: 120
          periodSeconds: 60
          failureThreshold: 8
          timeoutSeconds: 5

tpetrov-lp avatar Nov 24 '20 18:11 tpetrov-lp

Right now I do not have environment (high load means every min pod coming & going) So not able to validate the fix. Sorry cc @edsiper

avdhoot avatar Nov 25 '20 11:11 avdhoot

@edsiper Sorry for delaying, I've started to test it. I will let it run for a couple of days, once I'll have an insight I will update you.

tomerleib avatar Nov 25 '20 13:11 tomerleib

@edsiper We're running into that same issue. I'm building the Docker image and deploying it here too and will try to report back. If anyone wants to quickly consume it, we've published it to our registry:

registry.public.mtl1.vexxhost.net/openstack/fluent-bit:1.6.6-1

If you're using the Helm charts, this should do:

image:
  repository: registry.public.mtl1.vexxhost.net/openstack/fluent-bit
  tag: 1.6.6-1

mnaser avatar Nov 25 '20 17:11 mnaser

@edsiper here's a quick update. high load, 3 nodes cluster. The DS is running for 10 hours, currently, 2/3 of the pods are still processing logs. 1 pod has stopped processing any log after 8 hours, nothing in the log and nothing in K8s events for this pod. Another pod that still runs, reported that it was terminated some time ago due to error code 135.

tomerleib avatar Nov 25 '20 23:11 tomerleib

Version 1.6.6 is deployed in our production (13 big nodes) since yesterday, we will wait in 2-3 days if the fix is working for us.

mtparet avatar Nov 27 '20 13:11 mtparet

I will start testing with v1.6.6 as well and see if it's working or not, I believe that by tomorrow I will know for sure if at least one stopped working.

tomerleib avatar Nov 29 '20 12:11 tomerleib

So far, there are no more freezes on our side.

mtparet avatar Dec 02 '20 09:12 mtparet

I will start testing it simultaneously, first with @mnaser image. @mtparet can you share the image that you're using? In my environment when I've built the image, if stopped working for this.

tomerleib avatar Dec 03 '20 09:12 tomerleib

@tomerleib I am using 1.6.6 version with this configuration :

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_Tag_Prefix     kube.var.log.containers.
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               kube.*
    Refresh_Interval  5
    Mem_Buf_Limit     6MB
    Skip_Long_Lines   On
    read_from_head    on
    DB                /tail-db/tail-containers-state.db
    DB.Sync           Off
    DB.locking        true
[OUTPUT]
    Name  es
    Match *
    Host  XXXX
    Port  443
    Logstash_Format On
    Retry_Limit 5
    Type  _doc
    Trace_Error true
    Time_Key @timestamp-flb
    Replace_Dots On

    HTTP_User XXX
    HTTP_Passwd XXXX

    tls on
    tls.verify on

mtparet avatar Dec 03 '20 10:12 mtparet

Small clarification, when folks here are saying they're using 1.6.6 -- as in upstream official 1.6.6 or our patched version? I think that would help the maintainers. My decision to use 1.6.6-1 is that the assumption that it is 1.6.6 + 1 patch.

Also, 1.6.7 was released officially a few hours ago too.

mnaser avatar Dec 03 '20 13:12 mnaser

Small clarification, when folks here are saying they're using 1.6.6 -- as in upstream official 1.6.6 or our patched version? I think that would help the maintainers. My decision to use 1.6.6-1 is that the assumption that it is 1.6.6 + 1 patch.

Also, 1.6.7 was released officially a few hours ago too.

Now, 1.6.8 is out :D

cabrinha avatar Dec 03 '20 19:12 cabrinha