fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Fluentbit is stuck, flb chunks are not flushed and not sending data to ElasticSearch

Open joshisumit opened this issue 2 years ago • 55 comments

Bug Report

Describe the bug Fluent is sending data to elasticsearch and it works fine for sometime and then its stuck and not sending data to elastic. On checking fluentbit pod logs, Briefly I see below 3 issues (Attached full logs below):

- [ info] [task] re-schedule retry=0x7f6a1ecc7b68 135 in the next 8 seconds
- [ warn] [engine] failed to flush chunk '1-1648141166.449162875.flb', retry in 8 seconds: task_id=517, input=storage_backlog.1 > output=es.0 (out_id=0)
- [debug] [input:tail:tail.0] 0 new files found on path '/var/log/kube-apiserver-audit.log'  
- (even though the file '/var/log/kube-apiserver-audit.log' exists inside a pod )

Referred similar issues, still no luck:

  • As per https://github.com/fluent/fluent-bit/issues/3581#issuecomment-1081107592 I have tried to use 1.9.1-debug version as well, still seeing same results
  • https://github.com/fluent/fluent-bit/issues/4336

To Reproduce

  • start fluentbit as Daemonset on K8S and add node affinity to run only on master node
  • Use tail plugin to collect API server Audit Logs (Generated on path /var/log/kube-apiserver-audit.log )
  • Please refer configs mentioned below in Configuration section

After that Seeing below logs frequently in fluentbit pod:

[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351188.130831321.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351188.130831321.flb
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecc7b68 135 in the next 8 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecd7478 191 in the next 10 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecb0b48 104 in the next 10 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed02588 85 in the next 7 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed02498 64 in the next 11 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ed022e0 30 in the next 11 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1eccae08 147 in the next 9 seconds
[2022/03/30 10:18:38] [ info] [task] re-schedule retry=0x7f6a1ecb6c28 28 in the next 9 seconds

....

[2022/03/30 10:18:38] [debug] [input:tail:tail.0] scanning path /var/log/kube-apiserver-audit.log
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/kube-apiserver-audit.log, inode 1882
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/kube-apiserver-audit.log'
[2022/03/30 10:18:38] [debug] [input:tail:tail.0] inode=1882 events: IN_MODIFY 
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648635501.574187623.flb mapped OK
[2022/03/30 10:18:38] [debug] [storage] [cio file] alloc_size from 1114112 to 1245184
[2022/03/30 10:18:38] [debug] [storage] [cio file] synced at: tail.0/1-1648635501.574187623.flb
[2022/03/30 10:18:38] [debug] [input chunk] update output instances with new chunk size diff=128565
[2022/03/30 10:18:38] [debug] [retry] new retry created for task_id=517 attempts=1
[2022/03/30 10:18:38] [ warn] [engine] failed to flush chunk '1-1648141166.449162875.flb', retry in 8 seconds: task_id=517, input=storage_backlog.1 > output=es.0 (out_id=0)
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351224.176425521.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351224.176425521.flb
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351264.88207341.flb mapped OK
[2022/03/30 10:18:38] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:1-1648351264.88207341.flb
[2022/03/30 10:18:38] [debug] [storage] tail.0:1-1648351301.408233326.flb mapped OK

On checking further inside fluentbit pod, I see chunks are not cleared and filling up the disk:

root@fluent-bit-audit-drbxj:/var/log# du -sh flb-storage/tail.0/
26G     flb-storage/tail.0/ <------

root@fluent-bit-audit-qbkw8:/var/log# ls -la kube-apiserver-audit*
-rw------- 1 root root 104857010 Mar 31 08:19 kube-apiserver-audit-2022-03-31T08-19-39.831.log
-rw------- 1 root root  95179025 Mar 31 08:46 kube-apiserver-audit.log

Expected behavior

  • Keep sending data to ElasticSearch, should not be stuck
  • Should be able to flush chunk files(*.flb files) regularly (in my case directory is /var/log/flb-storage/tail.0/

Your Environment

  • Version used: 1.8.15 and 1.9.1-debug (tested on both)
  • Configuration (fluentbit configuration):
  # Configuration files: server, input, filters and output
  # ======================================================
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     debug
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020
        storage.path              /var/log/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 50M
    @INCLUDE input-audit.conf
    @INCLUDE output-elasticsearch.conf

  input-audit.conf: |
    [INPUT]
        Name              tail
        Tag               kube-audit.*
        Path              /var/log/kube-apiserver-audit.log
        Parser            json
        DB                /var/log/flb_kube_audit.db
        storage.type      filesystem
        Mem_Buf_Limit     800MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        Buffer_Chunk_Size 4096k
        Buffer_Max_Size   9000k

  output-elasticsearch.conf: |
    [OUTPUT]
        Name            es
        Match           kube-audit.*
        Host            ${KUBE_ELASTICSEARCH_HOST}
        Port            ${KUBE_ELASTICSEARCH_PORT}
        Index           kube-audit-dev
        Buffer_Size     False
        Generate_ID     on
        AWS_Auth        On
        AWS_Region      us-east-1
        Replace_Dots    On
        Retry_Limit     False
        tls             On
        Trace_Error     On
        Logstash_Format   On
        Logstash_Prefix   kube-audit-dev

  parsers.conf: |
    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.22.4
  • Server type and version: Running as docker image on Kubernetes master node (using docker image: 1.9.1-debug
  • Operating System and version: Linux
  • Filters and plugins: Plugin: tail, es

Additional context

  • I am running a separate copy of fluentbit as daemonset (which runs only on master node) and collect API server Audit Logs (Generated on path /var/log/kube-apiserver-audit.log )
  • Initially, its working fine. After some time, chunks are piling up (failed to flush) and I see retry messages in log and fluentbit stops sending data to ElasticSearch.

joshisumit avatar Mar 31 '22 09:03 joshisumit

@lecaros @edsiper

Need your inputs on the same. Let me know If I am missing anything/not configured correctly on Fluentbit configuration.

joshisumit avatar Mar 31 '22 09:03 joshisumit

Same issues with fluent-bit:1.9.2 .

dyipon avatar Apr 09 '22 08:04 dyipon

Hi @joshisumit, do you see the input records metric still going up? (i.e. it continues reading events) With debug log_level enabled, we should see more details on what's happening with your chunks. Can you share the full log or its tail (while it's not sending any more events to es) at least? Thanks

lecaros avatar Apr 22 '22 21:04 lecaros

@lecaros I think i had network issues, but now it's working without any issues.

dyipon avatar Apr 22 '22 21:04 dyipon

Thanks @dyipon for confirming. Can you @joshisumit confirm you still have this issue?

lecaros avatar Apr 22 '22 22:04 lecaros

I am still seeing this issue with forward output and disk buffering with fluentbit 1.9.3.

[2022/05/04 19:15:58] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 4 seconds
[2022/05/04 19:16:02] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:08] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:14] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:20] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 7 seconds
[2022/05/04 19:16:27] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 4 seconds
[2022/05/04 19:16:31] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:37] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 4 seconds
[2022/05/04 19:16:41] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 6 seconds
[2022/05/04 19:16:47] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 4 seconds
[2022/05/04 19:16:51] [ info] [task] re-schedule retry=0x7fe1729c8220 7 in the next 7 seconds

fpytloun avatar May 04 '22 19:05 fpytloun

Poor man's workaround might be to delete stucked chunks, eg. periodically delete everything not modified in last hour:

find /var/log/flb-storage -type f -name "*.flb" -mmin +60 | xargs rm -f || true

fpytloun avatar May 05 '22 11:05 fpytloun

@fpytloun do you have repro to this issue? The messages you posted don't necessarily mean you are having an issue.

Thanks

lecaros avatar May 23 '22 20:05 lecaros

Unfortunately, I was not able to find out what's causing this. I just know that it's persistent issue (it repeats even after restart until disk buffer is wiped), so I assumed it could be eg. some garbage inside buffered messages?

With memory buffering and without retries, affected event is simply dropped and this issue doesn't occur.

fpytloun avatar May 23 '22 20:05 fpytloun

Hi everyone, in order to address this, we need a reproducible scenario. If any of you have repro steps and can answer other questions posted to troubleshoot this, please provide them and we'll gladly assist.

lecaros avatar May 24 '22 20:05 lecaros

Hi I got same issue, and resolved the problem by deleting *.flb.

To explain my situation in detail, I was using aws-for-fluent-bit version 2.10.0 (fluent-bit 1.6.x) and the crash continued to occur, so I upgraded to 2.25.1 (fluent-bit 1.9.3). I wanted to use the stable version of aws-for-fluent-bit, but in the stable version, an error such as invalid back log continued to occur, so I upgraded to the 1.9.x version that can skip the error.

And my fluent-bit situation is like the picture below. The pod with the error has unusually high memory(6xx ~ 8xx) Screen Shot 2022-06-01 at 1 09 58 AM

And the number of flb files is unusually high. (hundred thousands ~ miliions) Screen Shot 2022-06-01 at 1 08 40 AM

The reason why so many flb files could not be flushed is guessing that the crash occurred in 1.6.x. Therefore, it seems to be a problem that occurs when there are many flb files.

swkoo-bhsn avatar May 31 '22 18:05 swkoo-bhsn

Changed configuration a little to switch from forward to http output and still facing this issue when enabling storage buffer and retries. It seems to happen every time 10 minutes after starting fluentbit. No error in logs or other clue. Fluentbit healthcheck still thinks it's healthy.

This is pretty serious issue and I have no idea why everyone is not hitting it. I can provide oldest flb (so one that might have cause the deadlock).

Here's full config that I am running: https://pastebin.com/3JSwBK2k

fpytloun avatar Jul 19 '22 09:07 fpytloun

I found out that my Fluentbit is sending events with non-unicode characters (eg. UTF-16/UTF-32). Being logged on Vector's HTTP sink (fluentbit upstream) as error with parsing JSON payload.

So I was thinking if these garbled characters might cause deadlock. I tried to ensure these characters will not be present and so far it seems that issue went away.

Beside ensuring on log source side, I also tried this Lua filter to replace non-ascii characters:

function deep_replace(obj, source, target)
   if obj then
      if type(obj) == "table" then
         local result = {}
         for k,v in pairs(obj) do
            result[k] = deep_replace(v, source, target)
         end
         return result
      elseif type(obj) == "string" then
         return obj:gsub(source, target)
      else
         return obj
      end
   end

   return obj
end


function escape_non_ascii(tag, timestamp, record)
   local new_record = deep_replace(record, "([\128-\255])", "\\\\%1")
   return 1, timestamp, new_record
end
    [FILTER]
        Name    lua
        Match   *
        Script  /fluent-bit/scripts/filters.lua
        Call    escape_non_ascii

UPDATE: ok, maybe I was too optimistic. Found one instance hitting backpressure without obvious reason:

rewrite_kube (emitter)
│
├─ status
│  └─ overlimit     : yes
│     ├─ mem size   : 56.2M (58912182 bytes)
│     └─ mem limit  : 9.5M (10000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 41
│  ├─ new           : 0
│  ├─ running       : 41
│  └─ size          : 56.2M (58912182 bytes)
│
└─ chunks
   └─ total chunks  : 510
      ├─ up chunks  : 42
      ├─ down chunks: 468
      └─ busy chunks: 41
         ├─ size    : 56.2M (58912182 bytes)
         └─ size err: 0

UPDATE2: over few hours, all instances got deadlocked again. No error in logs except that total storage size (1GB) of http output plugin is reached:

[error] [input chunk] chunk 7-1661314436.598970034.flb would exceed total limit size in plugin http.0
[error] [input chunk] no available chunk

Storage is full of chunks from rewrite_tag emitter.

fpytloun avatar Aug 23 '22 16:08 fpytloun

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Nov 23 '22 02:11 github-actions[bot]

Not stale

aslafy-z avatar Nov 23 '22 07:11 aslafy-z

@lecaros

Hi everyone, in order to address this, we need a reproducible scenario. If any of you have repro steps and can answer other questions posted to troubleshoot this, please provide them and we'll gladly assist.

One of the problems I have with this is that it occurs only when elastic has issues. And we have not been able to figure out how to force the issue. It could run for days or weeks without seeing the problem, and when elastic has issues that causes rejections this occurs. It's difficult to run fluent-bit with debug log enabled for the length of time necessary to reproduce this. It does appear to occur when elastic prevents ingestion for some significant period of time (at least a few hours). Is there anything that you can tell us about how fluent-bit handles retrying flb chunks? It's interesting that the workarounds above of deleting older flb files seems to unblock things. Is there anything from that that makes sense for us to gather more targetted debugging?

awangptc avatar Dec 19 '22 19:12 awangptc

I'm having a hard time collecting the logs when this occurs with debug on. The issue is, when the error condition occurs, our fluent-bit logs are no longer being sent to elatic. But the sheer volume of debug logs causes our k8s nodes to rotate the logs more quickly than I can pull them. I'm sure their in flb files but I don't know how to extract them.

The scenario that causes the problem for us is, something cause elastic search to start reject requests - we have seen two conditions that cause this

  1. excessive load resulting in elastic sending back 429s to requests
  2. elastic ran out of disk space

when these occur, fluent-bit stops sending logs to elastic but continues to read from tail into new flb files.

If we restart fluent-bit it attempts to reprocess the flb files, but if there are multiple nodes, restarting all of the fluent-bit pods will very very quickly overwhelm elastic adn elastic will start to throttle.. 429s.

This is resulting in a high retry/failure rate. We have our retry limit set to 10. If a limit is reached what happens to the flb files? Do they get removed for that transaction? If not, that might be part of our issue. I've discovered that if i rotate each fluent-bit pod ONE at a time, elastic does not get overloaded and it is able to work through the backlog flb files. But that's a time consuming process of course.

awangptc avatar Dec 19 '22 21:12 awangptc

met same error, but I use loki as backend https://github.com/fluent/fluent-bit/issues/6670

allenhaozi avatar Jan 11 '23 09:01 allenhaozi

I have the same issue with fluentd as a backend (using logging-operator)

aslafy-z avatar Jan 11 '23 13:01 aslafy-z

Hello @awangptc,

If a limit is reached what happens to the flb files?

flb files are no longer retried and are deleted from the storage.

Are you all using the latest image? We've recently (last month) merged some fixes related to irrecoverable chunks handling.

lecaros avatar Jan 17 '23 21:01 lecaros

@lecaros, I tried 2.0.5 when we first ran into the problem in hopes that it worked better and it was incredibly unstable:

#5  0x557ee21c1b16      in  flb_malloc() at include/fluent-bit/flb_mem.h:79
#6  0x557ee291afd8      in  ares_init_options() at lib/c-ares-1.18.1/src/lib/ares_init.c:114
#7  0x557ee21c66b2      in  flb_net_dns_lookup_context_create() at src/flb_network.c:969
#8  0x557ee21c690a      in  flb_net_getaddrinfo() at src/flb_network.c:1035
#9  0x557ee21c6eef      in  flb_net_tcp_connect() at src/flb_network.c:1231
#10 0x557ee22758f1      in  flb_io_net_connect() at src/flb_io.c:130
#11 0x557ee21f1206      in  create_conn() at src/flb_upstream.c:533
#12 0x557ee21f1714      in  flb_upstream_conn_get() at src/flb_upstream.c:682
#13 0x557ee247790c      in  cb_es_flush() at plugins/out_es/es.c:808
#14 0x557ee21b6da0      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:519
#15 0x557ee2be6e46      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
free(): corrupted unsorted chunks
[2022/11/18 18:36:26] [engine] caught signal (SIGSEGV)
#0  0x7f0d0e9a7602      in  ???() at ???:0
#1  0x7f0d0ea00767      in  ???() at ???:0
#2  0x7f0d0ea07a59      in  ???() at ???:0
#3  0x7f0d0ea09121      in  ???() at ???:0
#4  0x560ea4114175      in  flb_free() at include/fluent-bit/flb_mem.h:120
#5  0x560ea411540a      in  flb_sds_destroy() at src/flb_sds.c:470
#6  0x560ea43f8b4f      in  elasticsearch_format() at plugins/out_es/es.c:579
#7  0x560ea43f999e      in  cb_es_flush() at plugins/out_es/es.c:814
#8  0x560ea4138da0      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:519
#9  0x560ea4b68e46      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

And unfortunately haven't been able to revisit. It looks like the changelog shows that there were a bunch of improvements in memory-related areas. Do you think stuff like that has been addressed? If so, I'll take a closer look at the latest 2.0 builds and see if it makes a difference.

awangptc avatar Jan 18 '23 18:01 awangptc

Please try the latest image. Either 2.0.8 or even nightly builds.

lecaros avatar Jan 18 '23 19:01 lecaros

Having a similar issue with public.ecr.aws/aws-observability/aws-for-fluent-bit:init-2.31.2 sending logs to AWS OpenSearch.

[2023/02/28 20:41:40] [ warn] [engine] failed to flush chunk '6-1677616889.860927077.flb', retry in 21 seconds: task_id=2, input=forward.1 > output=opensearch.1 (out_id=1)
[2023/02/28 20:41:41] [ warn] [engine] chunk '6-1677616877.850448169.flb' cannot be retried: task_id=1, input=forward.1 > output=opensearch.1
[2023/02/28 20:41:42] [ warn] [engine] failed to flush chunk '6-1677616901.873936214.flb', retry in 10 seconds: task_id=1, input=forward.1 > output=opensearch.1 (out_id=1)
[2023/02/28 20:41:42] [ warn] [engine] failed to flush chunk '6-1677616901.874024444.flb', retry in 6 seconds: task_id=4, input=forward.1 > output=opensearch.1 (out_id=1)
[2023/02/28 20:41:43] [ warn] [engine] chunk '6-1677616877.850342703.flb' cannot be retried: task_id=0, input=forward.1 > output=opensearch.1
[2023/02/28 20:41:48] [ warn] [engine] failed to flush chunk '6-1677616901.874024444.flb', retry in 7 seconds: task_id=4, input=forward.1 > output=opensearch.1 (out_id=1)

rectalogic avatar Feb 28 '23 20:02 rectalogic

We've been running fluent-bit 2.0.8 for about a month now and things seem more stable. We haven't had a catastropihic outage of elastic that seems to trigger the rproblem before (elastic being down for hours at a time) but we have had elastic being down for 20-30 minutes or so at a time and so far we haven't seen fluent-bit 2.0.8 get stuck. We're planning on trying to implement a failure situation - shutdown elastic, or block it's port from the client - for a few hours at a time to see if the issue still occurs but it's a little lower on our priority list to try.

awangptc avatar Feb 28 '23 21:02 awangptc

@awangptc would you mind sharing your fluent-bit config? I am running into issues with logs not being forwarded. Only 1 pod (base nginx test pod) I can find being forwarded in Kibana. I cannot see my application logs, I am not sure why. I get various different issues no matter what I set. Here is my config.

apiVersion: v1
items:
- apiVersion: v1
  data:
    custom_parsers.conf: |
      [PARSER]
          Name docker_no_time
          Format json
          Time_Keep On
          Time_Key time
          Time_Format %Y-%m-%dT%H:%M:%S.%L
    fluent-bit.conf: |
      [SERVICE]
          Daemon Off
          Flush 1
          Log_Level info
          Parsers_File parsers.conf
          Parsers_File custom_parsers.conf
          HTTP_Server On
          HTTP_Listen 0.0.0.0
          HTTP_Port 2020
          Health_Check On

      [INPUT]
          Name tail
          Path /var/log/containers/*.log
          multiline.parser docker, cri
          Tag kube.*
          Mem_Buf_Limit 50MB
          Skip_Long_Lines Off

      [INPUT]
          Name systemd
          Tag host.*
          Systemd_Filter _SYSTEMD_UNIT=kubelet.service
          Read_From_Tail On

      [FILTER]
          Name kubernetes
          Match kube.*
          Merge_Log On
          Keep_Log Off
          K8S-Logging.Parser On
          K8S-Logging.Exclude On

      [OUTPUT]
          Name es
          Match kube.*
          Host            ${FLUENT_ELASTICSEARCH_HOST}
          Port            ${FLUENT_ELASTICSEARCH_PORT}
          HTTP_User       ${FLUENT_ELASTICSEARCH_USER}
          HTTP_Passwd     ${FLUENT_ELASTICSEARCH_PASSWD}
          tls             On
          tls.verify      On
          tls.ca_file     ${FLUENT_ELASTICSEARCH_TLSCA}
          tls.crt_file    ${FLUENT_ELASTICSEARCH_TLSCRT}
          tls.key_file    ${FLUENT_ELASTICSEARCH_TLSKEY}
          Logstash_Format On
          Suppress_Type_Name On
          Retry_Limit     False
          
      [OUTPUT]
          Name es
          Match host.*
          Host            ${FLUENT_ELASTICSEARCH_HOST}
          Port            ${FLUENT_ELASTICSEARCH_PORT}
          HTTP_User       ${FLUENT_ELASTICSEARCH_USER}
          HTTP_Passwd     ${FLUENT_ELASTICSEARCH_PASSWD}
          tls             On
          tls.verify      On
          tls.ca_file     ${FLUENT_ELASTICSEARCH_TLSCA}
          tls.crt_file    ${FLUENT_ELASTICSEARCH_TLSCRT}
          tls.key_file    ${FLUENT_ELASTICSEARCH_TLSKEY}
          Logstash_Format On
          Suppress_Type_Name On
          Logstash_Prefix node
          Retry_Limit False

devanshbaffle avatar Feb 28 '23 21:02 devanshbaffle

I am having this issue with the AWS fluentbit fork - I filed a bug with them that has a full reproducible example with all configs here https://github.com/aws/aws-for-fluent-bit/issues/568

I also tested those configs with fluent/fluent-bit:2.0.9 and was able to reproduce it.

In both cases, if I comment out the two "parser" filters in my filters.conf, then this issue does not happen (but of course my messages are unparsed)

rectalogic avatar Mar 01 '23 18:03 rectalogic

I was able to fix my issue with Replace_Dots On in my config file. Apparently, it is required since Elasticsearch 2.0.x, but I am running the latest Elasticsearch 8.5.x.

I hope this helps someone.

devanshbaffle avatar Mar 01 '23 18:03 devanshbaffle

I tried with Replace_Dots On in the opensearch output but with the same results. 2 of the messages make it to opensearch (like before, but now with dots replaced) and 2 still fail with "failed to flush chunk"

rectalogic avatar Mar 01 '23 18:03 rectalogic

@rectalogic could you put together a minimal reproduction case? If you do please ping me and I'll try to fix it.

leonardo-albertovich avatar Mar 01 '23 19:03 leonardo-albertovich

@leonardo-albertovich see https://github.com/aws/aws-for-fluent-bit/issues/568 - it has all the config files and the docker command to run to reproduce. You just need to change the -e OPENSEARCH_FQDN=search.play-test.internal.aws.cureatr.com on that docker command to point to your OpenSearch instance.

Also if you change the image in that command from public.ecr.aws/aws-observability/aws-for-fluent-bit:init-2.31.3 to fluent/fluent-bit:2.0.9 you can still reproduce the issue.

rectalogic avatar Mar 01 '23 21:03 rectalogic