fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Multiline filter crashes after buffer limit is reached causing high cpu and memory usage

Open staniondaniel opened this issue 3 years ago • 48 comments
trafficstars

Bug Report

Describe the bug

Hello

Multiline filter is crashing on pods that generate a large amount of logs after reaching Emitter_Mem_Buf_Limit . On pods with a normal/low number of logs it works without problems

To Reproduce

[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inode=4540032 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inotify_fs_add(): inode=4540032 watch_fd=113 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:07] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4592513 watch_fd=111
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=3353782 watch_fd=112
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4540032 watch_fd=113
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inode=3353789 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=3353789 watch_fd=114 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ warn] [input] emitter.3 paused (mem buf overlimit)
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log

  • Steps to reproduce the problem:

This is my configuration (i left only the relevant parts):

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        DB /var/log/containers/fluentbit_db.db
        Parser docker
        Tag kube.*
        Mem_Buf_Limit 10MB
        Buffer_Chunk_Size 256k
        Buffer_Max_Size 256k
        Skip_Long_Lines On
        Refresh_Interval 1
        multiline.parser docker,cri
......


    [FILTER]
        name multiline
        Emitter_Mem_Buf_Limit 2.4GB
        match kube.*
        multiline.key_content log
        multiline.parser java,go,python
    [FILTER]
        Name kubernetes
        Buffer_Size 512k
        Match kube.*
        Merge_Log On
        Merge_Log_Key log_json
        Merge_Log_Trim On
        Keep_Log On
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

....

Your Environment

  • Version used: Fluent Bit v1.8.12
  • Environment name and version (e.g. Kubernetes? What version?): v1.20.4-eks
  • Cpu and memory load after enabling the multiline filter. I tried increasing memory limit and emitter_buffer limit to a few GB and the processes were still crashing image

Additional context

Fluentbit container keeps crashing after it gets to the memory limit configured for that container. Also a lot of logs like

[error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag:

are flooding fluentbit logs

staniondaniel avatar Feb 28 '22 13:02 staniondaniel

Same issue, i generate test log with about 10MB/S, fluent-bit will log bounch of error log. image

Is there any way to work around? @edsiper

wmingj avatar Mar 03 '22 09:03 wmingj

Hi, i may found this issue is intruduced by PR:https://github.com/fluent/fluent-bit/pull/4383. When i downgrade fluent-bit to 1.8.11, flb works pretty well

wmingj avatar Mar 04 '22 06:03 wmingj

We are experiencing exactly the same issue, with the main problem being that we actually need the functionality of the previously mentioned PR for the multiline logs. We use the amazon/aws-for-fluent-bit and run it next to our own containers using the built-in integration with firelens [2022/03/07 10:10:52] [error] [input:emitter:emitter_for_multiline.2] error registering chunk with tag: xxx-firelens-GUID The container has 50MB of allocated memory and it used to be more that enough, when now it used up the memory in 3 days and started just returning errors.

@PettitWesley could you have a look if there are any issues in your implementation or if there are any workarounds?

madmarcinadamski avatar Mar 07 '22 11:03 madmarcinadamski

I have added this on my personal backlog, which unfortunately right now is very long.

My recommendation is to take a look at past issues involving the rewrite_tag filter, with the new code, the multiline filter with buffer is very similar to rewrite_tag. This same issue should have arisen with that filter as well.

PettitWesley avatar Mar 15 '22 06:03 PettitWesley

So when I send multiline logs at a very high throughput, all I get is this:

[2022/04/04 03:38:14] [ warn] [input] emitter.1 paused (mem buf overlimit)

Which can be fixed by increasing the mem_buf_limit for the emitter:

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser	   go, multiline-regex-test

     emitter_mem_buf_limit 1G

PettitWesley avatar Apr 04 '22 03:04 PettitWesley

Please see: https://github.com/fluent/fluent-bit/issues/5235

If you have more than one multiline filter definition and they match the same records, it can cause all sorts of trouble.

PettitWesley avatar Apr 04 '22 03:04 PettitWesley

So I want to direct everyone to this issue again: https://github.com/fluent/fluent-bit/issues/5235

It is not clear to me that there is a widespread use case that truly requires multiple filter definitions for the same logs. Please post in that issue if you do not understand or you think you have a use case that requires multiple filter definitions that match the same logs.

PettitWesley avatar Apr 05 '22 20:04 PettitWesley

#5235

it is not clear to me what do you mean multiple filter definitions. Do you mean more of these here: multiline.parser java,go,python

or multiple config like the above in different section: `
[INPUT]

    multiline.parser docker,cri

[FILTER]

    multiline.parser java,go,python

`

I am using the above config because if i try to add the line to the INPUT section like this `

[INPUT]

    multiline.parser java,go,python

` It simply doesn't work and log is not merged

So this is why i am doing it in 2 stages:

  • first stage to "[INPUT]" to merge docker logs( because i am using EKS)
  • second to normal "[FILTER]" section so it will merge the logs which can come from various application written in go, java, or python

staniondaniel avatar Apr 07 '22 14:04 staniondaniel

@staniondaniel Multiple fitler means like I show in the issue like this:

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser	   go

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser      multiline-regex-test

Instead, the CORRECT way is what you show, to have multiple parsers in a comma list.

It simply doesn't work and log is not merged

that's interesting. In this case what you are doing is fine, you only have one filter definition. It's fine to use both the filter and tail input multiline.parser keys.

PettitWesley avatar Apr 07 '22 23:04 PettitWesley

I found there is a new buffer config parameter, set it to off seems to avoid this issue. doc: https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace#configuration-parameters

[FILTER]
    Name                             multiline
    match                            kube.*
    multiline.key_content   log
    multiline.parser            java
    buffer                            off

wmingj avatar May 20 '22 03:05 wmingj

@erenming With buffer off, the filter will only work with a few inputs, mainly tail. As noted in the docs.

PettitWesley avatar May 24 '22 01:05 PettitWesley

@PettitWesley Really thanks for your remind. Actually, we only use tail, so i think with buffer off will be okay. But i may found a issue with buffer off for these days, multiline seems not working :( This is my sample config:

[SERVICE]
    flush        0.5
    daemon       Off
    grace        5
    log_level    info
    parsers_file parsers.conf
    http_server  On
    http_listen  0.0.0.0
    http_port    2020

[INPUT]
    name                 tail
    path                 test.log
    DB                   flb_k8s.db
    Tag                  kube.*
    read_from_head       true
    #ignore_older         5m
    Buffer_Chunk_Size    1MB
    Buffer_Max_Size      10MB
    Skip_Long_Lines      On
    Skip_Empty_Lines     On
    Refresh_Interval     30
    Rotate_Wait          60
    DB.locking           true
    DB.journal_mode      WAL

    Mem_Buf_Limit        300MB
    multiline.parser     docker, cri

[FILTER]
    Name                   multiline
    match                  *
    multiline.key_content  log
    multiline.parser       java
    buffer                 Off

[OUTPUT]
    name stdout
    match *
    format json

and sample log:

{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:25:28.826167874Z"}
{"log":"io.terminus.trantorframework.exception.BusinessException: Schedule.request.param.is.illegal\n","stream":"stdout","time":"2021-08-16T08:24:28.826172954Z"}
{"log":"\u0009at io.terminus.gaaia.schedule.util.ValidateUtil.expectedTrue(ValidateUtil.java:39)\n","stream":"stdout","time":"2021-08-16T08:24:28.82617719Z"}
{"log":"\u0009at io.terminus.gaaia.schedule.func.schedule.validate.ValidateScheduleParamForCalculationFuncImpl.execute(ValidateScheduleParamForCalculationFuncImpl.java:47)\n","stream":"stdout","time":"2021-08-16T08:24:28.826182475Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}

Is there any mistake in my configuration?

wmingj avatar May 27 '22 02:05 wmingj

Is there any mistake in my configuration? I believe he meant that you need to use the parser config directly in the INPUT section and not in the [FILTER] section . something like:

[INPUT]
    name                 tail
......
    Mem_Buf_Limit        300MB
    multiline.parser     docker, cri, java

@PettitWesley as an update to the original problem. After updating to the latest Fluentbit version I noticed that using only the " multiline.parser docker,cri" in the INPUT section without any [FILTER] config like bellow:

    [INPUT]
        Name tail
......
        multiline.parser docker,cri

the pods don't crash anymore due to high MEMORY usage

As soon as I add the [FILTER] config as bellow:

   [FILTER]
        name multiline
        Emitter_Mem_Buf_Limit 2.4GB
        match kube.*
        multiline.key_content log
        multiline.parser java,go,python

The problem reoccurs and the memory slowly rises until it hits the buffer limit which causes the logs to be flooded with : [input:emitter:emitter_for_multiline.0] error registering chunk with tag: or it gets killed by kubernetes when memory usage of the pods hits the limits set for it

From what I noticed this is simply an issue of the multiline filter not being able to process the data fast enough or just not releasing the buffer once the data is processed which causes high cpu usage and leads to a crash eventually

staniondaniel avatar May 27 '22 08:05 staniondaniel

@PettitWesley We have been facing the same issue with aws-for-fluent-bit:2.25.0. Our custom fluentbit configuration is below.

[SERVICE] flush 1 storage.path /var/log/app-storage/ storage.sync normal [INPUT] Name forward unix_path /var/run/fluent.sock Mem_Buf_Limit 50MB storage.type filesystem [FILTER] name multiline match * multiline.key_content log multiline.parser docker,cri,java Emitter_Mem_Buf_Limit 50M emitter_storage.type filesystem

Firelens configuration

logConfiguration = { logDriver = "awsfirelens" options = { "Name": "datadog", "apikey": "", "Host": "http-intake.logs.datadoghq.com", "net.keepalive" : "false", "storage.total_limit_size" : "5G", "dd_service": "<serviceName>", "dd_source": "ecs-fargate-dev", "dd_message_key": "", "dd_tags": "env:dev", "TLS": "on", "provider": "ecs" } }

Logs do not show any error. It just fails with segmentation fault.

`---------------------------------------------------------------------------------------------------------------------------------------------------------

timestamp message
1656096324683 [2022/06/24 18:45:24] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096325650 [2022/06/24 18:45:25] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096326704 [2022/06/24 18:45:26] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096329689 [2022/06/24 18:45:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096331692 [2022/06/24 18:45:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096332695 [2022/06/24 18:45:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096333726 [2022/06/24 18:45:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096334685 [2022/06/24 18:45:34] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096336749 [2022/06/24 18:45:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096338648 [2022/06/24 18:45:38] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096344687 [2022/06/24 18:45:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096346709 [2022/06/24 18:45:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096347677 [2022/06/24 18:45:47] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096348703 [2022/06/24 18:45:48] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096349674 [2022/06/24 18:45:49] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096350678 [2022/06/24 18:45:50] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096351692 [2022/06/24 18:45:51] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096352721 [2022/06/24 18:45:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096353671 [2022/06/24 18:45:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096354685 [2022/06/24 18:45:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096355694 [2022/06/24 18:45:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096361716 [2022/06/24 18:46:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096363714 [2022/06/24 18:46:03] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096364712 [2022/06/24 18:46:04] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096365644 [2022/06/24 18:46:05] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096368704 [2022/06/24 18:46:08] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096369701 [2022/06/24 18:46:09] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096372686 [2022/06/24 18:46:12] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096375691 [2022/06/24 18:46:15] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096376685 [2022/06/24 18:46:16] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096378678 [2022/06/24 18:46:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096380698 [2022/06/24 18:46:20] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096383700 [2022/06/24 18:46:23] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096384679 [2022/06/24 18:46:24] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096386694 [2022/06/24 18:46:26] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096388690 [2022/06/24 18:46:28] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096390698 [2022/06/24 18:46:30] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096391696 [2022/06/24 18:46:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096392705 [2022/06/24 18:46:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096393696 [2022/06/24 18:46:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096395689 [2022/06/24 18:46:35] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096396682 [2022/06/24 18:46:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096399692 [2022/06/24 18:46:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096401703 [2022/06/24 18:46:41] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096403693 [2022/06/24 18:46:43] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096404708 [2022/06/24 18:46:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096405694 [2022/06/24 18:46:45] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096406682 [2022/06/24 18:46:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096410688 [2022/06/24 18:46:50] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096412651 [2022/06/24 18:46:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096413683 [2022/06/24 18:46:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096414671 [2022/06/24 18:46:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096416720 [2022/06/24 18:46:56] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096418662 [2022/06/24 18:46:58] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096419663 [2022/06/24 18:46:59] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096420682 [2022/06/24 18:47:00] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096421696 [2022/06/24 18:47:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096422655 [2022/06/24 18:47:02] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096425696 [2022/06/24 18:47:05] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096428686 [2022/06/24 18:47:08] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096430710 [2022/06/24 18:47:10] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096431661 [2022/06/24 18:47:11] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096434709 [2022/06/24 18:47:14] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096435655 [2022/06/24 18:47:15] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096437670 [2022/06/24 18:47:17] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096438655 [2022/06/24 18:47:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096439675 [2022/06/24 18:47:19] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096440691 [2022/06/24 18:47:20] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096441701 [2022/06/24 18:47:21] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096443747 [2022/06/24 18:47:23] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096445704 [2022/06/24 18:47:25] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096449695 [2022/06/24 18:47:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096450679 [2022/06/24 18:47:30] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096451681 [2022/06/24 18:47:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096452669 [2022/06/24 18:47:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096453645 [2022/06/24 18:47:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096454693 [2022/06/24 18:47:34] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096455683 [2022/06/24 18:47:35] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096456675 [2022/06/24 18:47:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096457698 [2022/06/24 18:47:37] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096459704 [2022/06/24 18:47:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096460664 [2022/06/24 18:47:40] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096462681 [2022/06/24 18:47:42] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096463700 [2022/06/24 18:47:43] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096464691 [2022/06/24 18:47:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096465694 [2022/06/24 18:47:45] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096466652 [2022/06/24 18:47:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096467653 [2022/06/24 18:47:47] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096468696 [2022/06/24 18:47:48] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096469747 [2022/06/24 18:47:49] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096472717 [2022/06/24 18:47:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096473665 [2022/06/24 18:47:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096474674 [2022/06/24 18:47:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096475721 [2022/06/24 18:47:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096476664 [2022/06/24 18:47:56] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096477653 [2022/06/24 18:47:57] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096480673 [2022/06/24 18:48:00] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096481704 [2022/06/24 18:48:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096482634 [2022/06/24 18:48:02] [engine] caught signal (SIGSEGV)
1656096483998 AWS for Fluent Bit Container Image Version 2.25.0
---------------------------------------------------------------------------------------------------------------------------------------------------------`

rajeev-netomi avatar Jun 22 '22 11:06 rajeev-netomi

Sorry folks, I may not be able to take a look at this immediately.

If someone is willing to try these techniques to get a stack trace for me that would help speed up the fix: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#segfaults-and-crashes-sigsegv

PettitWesley avatar Jun 27 '22 22:06 PettitWesley

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Sep 26 '22 02:09 github-actions[bot]

Not stale.

rajeev-netomi avatar Sep 26 '22 04:09 rajeev-netomi

unstale

PRIHLOP avatar Oct 20 '22 09:10 PRIHLOP

Having same issue here. Container use all request causing high CPU load.

    service: |
      [SERVICE]
          Daemon Off
          Flush 1
          Log_Level info
          Parsers_File parsers.conf
          Parsers_File custom_parsers.conf
          HTTP_Server On
          HTTP_Listen 0.0.0.0
          HTTP_Port 2020
          Health_Check On

    inputs: |
      [INPUT]
          Name tail	
          Tag kube.*	
          Path /var/log/containers/*.log	
          DB /var/log/flb_kube.db	
          Mem_Buf_Limit 128MB	
          Skip_Long_Lines On	
          Refresh_Interval 10	
          Buffer_Chunk_Size 256k
          Buffer_Max_Size 100MB
          multiline.parser docker, cri

    filters: |
      [FILTER]
          Name kubernetes	
          Match kube.*	
          Kube_URL https://kubernetes.default.svc:443	
          Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt	
          Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token	
          Kube_Tag_Prefix kube.var.log.containers.	
          Merge_Log Off	
          Merge_Log_Key log_processado	
          Merge_Log_Trim On	
          Keep_Log Off	
          K8S-Logging.Parser Off	
          K8S-Logging.Exclude Off	
          Labels On	
          Annotations Off
      [FILTER]
          Name                                 multiline
          Match                                kube.*
          multiline.key_content                log
          multiline.parser                     go, java, python
          buffer                               On
          emitter_mem_buf_limit                64M

fluentbit 1.9.9

jujubetsz avatar Oct 25 '22 20:10 jujubetsz

I'm facing the same issue when working with the multiline parser with the rewrite-tag filter. I'm using the 0.21.6 version of the fluentbit helm chart that uses the 2.0.6 version of fluentbit.

I'm using the tail input:

    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-app*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

And using the re-write tag like this:

   [FILTER]
      Name    rewrite_tag
      Match   kube.*
      Rule    $log .*\"log_audit\":true.*$ audit-logs.$kubernetes_namespace_name false

I can't really use the exclude filter in order to filter out these logs as I need them but I want them to go to other output than the other logs. Any idea how can I solve this issue? Is there any workaround for this?

This is my complete fluentbit configuration:

config:
  service: |
    [SERVICE]
      Daemon       Off
      Flush        {{ .Values.flush }}
      Log_Level    {{ .Values.logLevel }}
      Parsers_File parsers.conf
      Parsers_File custom_parsers.conf
      HTTP_Server  On
      HTTP_Listen 0.0.0.0
      HTTP_Port   {{ .Values.metricsPort }}
      Health_Check On

  inputs: |
    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-app-server*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-other-app*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

  filters: |
    [FILTER]
      Name                kubernetes
      Match               kube.*
      Kube_URL            https://kubernetes.default.svc:443
      Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
      Merge_Log           On
      Merge_Log_Key       data
      K8S-Logging.Parser  On
      K8S-Logging.Exclude On
    
    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes
      Add_prefix    kubernetes_

    [FILTER]
        Name          nest
        Match         kube.*
        Operation     lift
        Nested_under  kubernetes_labels
        Add_prefix    kubernetes_labels_
    
    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes_annotations
      Add_prefix    kubernetes_annotations_
    
    [FILTER]
      Name    rewrite_tag
      Match   kube.*
      Rule    $log .*\"log_audit\":true.*$ audit-logs.$kubernetes_namespace_name false

  outputs: |
    [OUTPUT]
      Name                s3
      Match               audit-logs.*
      bucket              my-bucket-name
      region              {{ .Values.region }}
      total_file_size     10M
      upload_timeout      1m
      use_put_object      on
      compression         gzip
      auto_retry_requests true
      s3_key_format       /namespace=$TAG[1]/year=%Y/month=%m/day=%d/hour=%H/%H:%M:%S-$UUID.gz
      role_arn            {{ .Values.fluentbit_role_arn }}
    
    [OUTPUT]
      Name                          s3
      Match                         kube.*
      bucket                        lakefs-audit-logs-{{ .Values.region }}-{{ .Values.environment }}
      region                        {{ .Values.region }}
      total_file_size               10M
      upload_timeout                1m
      use_put_object                on
      compression                   gzip
      auto_retry_requests           true
      s3_key_format_tag_delimiters ._
      s3_key_format                 /namespace=$TAG[5]/year=%Y/month=%m/day=%d/hour=%H/%H:%M:%S-$UUID.gz
      role_arn                      {{ .Values.fluentbit_role_arn }}
    
    [OUTPUT]
      Name                   loki
      Match                  kube.*
      Host                   {{ .Values.loki.endpoint }}
      port                   443
      tls                    on
      tls.verify             on
      http_user              {{ .Values.loki.username }}
      http_passwd            {{ .Values.loki.password }}
      auto_kubernetes_labels true
      Labels                 collector=fluent-bit, namespace=$kubernetes_namespace_name, pod=$kubernetes_pod_name, container=$kubernetes_container_name, Region={{ .Values.region }}

idanovo avatar Dec 15 '22 07:12 idanovo

I know it's ugly, but since i really need this working, i did setup a cronjob to rollout restart the fluent-bit pods every day at 03:00am.

jujubetsz avatar Jan 31 '23 14:01 jujubetsz

@PettitWesley Any idea if similar solution is supported for AWS ECS fargate as suggested by @jujubetsz ?

rajeev-netomi avatar Mar 04 '23 07:03 rajeev-netomi

I'm currently making some unrelated improvements that involve modifying the multi line core machinery so if anyone is able to put together a somewhat reliable reproduction (it would be awesome if it was minimized but understandable if that's not possible) with sample input data I could look into this.

If you need help putting together the reproduction case Just tag me or here or talk to me in slack.

leonardo-albertovich avatar Mar 04 '23 12:03 leonardo-albertovich

@rajeev-netomi I don't recommend restarting FLB at a certain time if its crashing... I'm not sure of a good solution here but I think you want to restart after it crashes. Or see if you can find an older version which doesn't crash. Sorry!

PettitWesley avatar Mar 06 '23 18:03 PettitWesley

Yeah @PettitWesley. Basically marking fluentbit as a non-essential container so that if it crashes we do not stop the entire task and restart the non-essential container. This will be very useful for scenarios where we can tolerate loss of logging for the short duration.

rajeev-netomi avatar Mar 14 '23 08:03 rajeev-netomi

On the contrary, if you need to stick with the version that crashes I would recommend preventively restarting the service. Think about it, when the binary crashes it means there probably was memory corruption and realistically speaking, the point in time where the invalid access caused the crash was was not the first so the longer you let it go, the more odds you have of whatever memory corruption issue there is affecting the integrity of the chunks (the data files that hold the ingested data while in transit).

So, ideally, I would recommend trying to downgrade to a version that doesn't crash but if you are forced to stick with the one, then I think you are better off actively recycling the pod rather and retaining control rather than waiting for it to "take care of itself".

I take that you don't have a reliable reproduction right?

leonardo-albertovich avatar Mar 14 '23 08:03 leonardo-albertovich

@leonardo-albertovich Currently there does not exist a stable build which does not have this issue. We have a very simple requirement that we want to concatenate multiline logs into single log event and for that we are using inbuilt filters. Below is the configuration file.

[FILTER] name multiline match * multiline.key_content log multiline.parser java

Dockerfile

FROM public.ecr.aws/aws-observability/aws-for-fluent-bit:2.29.1 ADD java-multi-line.conf /fluent-bit/configs/java-multi-line.conf

This container is crashing for the applications where there is high volume of logs.

rajeev-netomi avatar Mar 14 '23 08:03 rajeev-netomi

For anyone facing similar issues, we tried a lot of different configurations including increasing memory limits but it didn't work. In the end, we had to replace fluent-bit with fluentd to make our logging work with the multiline parser.

ahhda avatar May 01 '23 06:05 ahhda

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Jul 31 '23 01:07 github-actions[bot]

I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 https://github.com/fluent/fluent-bit/issues/7782

I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.

ryan65 avatar Aug 02 '23 12:08 ryan65