fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

[out_splunk] SIGSEGV with specific chunks

Open kiyutink opened this issue 1 year ago • 2 comments

Bug Report

Describe the bug

We're running a set up where we run fluent-bit with a forward input and filesystem buffering. Periodically a chunk lands on the filesystem that upon read crashes fluent-bit. That means every time it restarts it crashes as it tries to replay this chunk from backlog. We haven't been able to reproduce this issue reliably (apart from trying to load it with the faulty chunk, in which case it crashes). I unfortunately can't share the chunk as it contains customer data.

Here's the log line we're seeing upon the crash:

[2024/06/21 10:16:25] [engine] caught signal (SIGSEGV)
#0  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#1  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#2  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#3  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#4  0x55ae116e533a      in  flb_msgpack_to_json() at src/flb_pack.c:768
#5  0x55ae116e5457      in  flb_msgpack_raw_to_json_sds() at src/flb_pack.c:808
#6  0x55ae117e5bc3      in  splunk_format() at plugins/out_splunk/splunk.c:500
#7  0x55ae117e6424      in  cb_splunk_flush() at plugins/out_splunk/splunk.c:658
#8  0x55ae11c03ae6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0xffffffffffffffff  in  ???() at ???:0

Your Environment

  • Version used: 3.0.6
  • Configuration:
 [SERVICE]
        HTTP_Server On
        Health_Check On

        Storage.max_chunks_up 512
        Storage.backlog.mem_limit 100M

        Storage.path /var/log/flb-storage/

        Storage.sync normal
        Storage.metrics On
    [INPUT]
        Name Forward
        Storage.type filesystem

    [OUTPUT]
        Name Splunk
        Match *
        Host <our host>
        Port 443
        Splunk_Token ${SPLUNK_TOKEN}
        TLS On
        TLS.Verify On
        Event_index <index>
        Event_sourcetype fluentd

        Retry_Limit False
        Storage.total_limit_size 10GB


  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes (but we can also reproduce locally if we offload the same chunk). We're using eks kubernetes v1.28
  • Server type and version:
  • Operating System and version:
  • Filters and plugins:

Additional context

We tried adjusting the chunk to determine which lines specifically in it might cause the crash, but were unable to manipulate the chunks in a way that would make them readable.

Please advise what we could try to assist with solving this somehow.

We have a strong suspicion that this appeared after we bumped to v3. the only somewhat relevant code change we found was here: https://github.com/fluent/fluent-bit/pull/8589/files

kiyutink avatar Jun 21 '24 12:06 kiyutink

Can you try the latest 3.0.7?

patrick-stephens avatar Jun 21 '24 13:06 patrick-stephens

Can you try the latest 3.0.7?

  • Same behavior in 3.0.7
  • Same behavior in 3.0.0
  • Doesn't crash in 2.2.3

kiyutink avatar Jun 24 '24 15:06 kiyutink

please upload the chunks that helps to reproduce the issue here (if you have sensitive data we can keep it private and share it through a different channel)

edsiper avatar Jul 16 '24 03:07 edsiper

Fluent Bit Version : 2.1.2 Kubernetes Version: 1.30

Hello @edsiper Facing the same issue intermittently at our end as well. But the frequency of restarts is high, almost everyday atleast 5-6 times fluent-bit pod crashes with below logs : [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: [2024/07/25 21:09:16] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=26361793 watch_fd=11120 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: [2024/07/25 21:09:16] [ info] [input:tail:tail.3] inotify_fs_remove(): inode=26361793 watch_fd=11115 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: [2024/07/25 21:09:16] [ warn] [filter:kubernetes:kubernetes.0] invalid pattern for given tag kube.devtroncd.bp-dcd-cas-bp-dcd-aws-cluster-autoscaler-8588c68b4d-wdj2m.aws-cluster-autoscaler [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: [2024/07/25 21:09:16] [engine] caught signal (SIGSEGV) [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #0 0x563ec7bb2304 in edata_arena_ind_get() at lib/jemalloc-5.3.0/include/jemalloc/internal/edata.h:258 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #1 0x563ec7bb2304 in tcache_bin_flush_match() at lib/jemalloc-5.3.0/src/tcache.c:301 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #2 0x563ec7bb2304 in tcache_bin_flush_impl() at lib/jemalloc-5.3.0/src/tcache.c:434 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #3 0x563ec7bb2304 in tcache_bin_flush_bottom() at lib/jemalloc-5.3.0/src/tcache.c:519 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #4 0x563ec7bb2304 in je_tcache_bin_flush_small() at lib/jemalloc-5.3.0/src/tcache.c:529 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #5 0x563ec7bb3699 in tcache_gc_small() at lib/jemalloc-5.3.0/src/tcache.c:148 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #6 0x563ec7bb5751 in ???() at lib/jemalloc-5.3.0/src/tcache.c:414 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #7 0x563ec7bb800f in je_te_event_trigger() at lib/jemalloc-5.3.0/src/thread_event.c:299 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #8 0x563ec7b4908c in te_event_advance() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:287 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #9 0x563ec7b4908c in thread_dalloc_event() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:293 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #10 0x563ec7b4908c in ifree() at lib/jemalloc-5.3.0/src/jemalloc.c:2896 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #11 0x563ec7b4908c in je_free_default() at lib/jemalloc-5.3.0/src/jemalloc.c:3021 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #12 0x563ec7c7d04a in msgpack_sbuffer_destroy() at lib/msgpack-c/include/msgpack/sbuffer.h:41 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #13 0x563ec7c80135 in flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1532 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #14 0x563ec7c7dbff in package_content() at src/multiline/flb_ml.c:335 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #15 0x563ec7c7e031 in process_append() at src/multiline/flb_ml.c:479 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #16 0x563ec7c7e55f in ml_append_try_parser() at src/multiline/flb_ml.c:637 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #17 0x563ec7c7e678 in flb_ml_append_text() at src/multiline/flb_ml.c:679 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #18 0x563ec7dd4a14 in process_content() at plugins/in_tail/tail_file.c:505 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #19 0x563ec7dd75ad in flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1413 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #20 0x563ec7dc1564 in in_tail_collect_event() at plugins/in_tail/tail.c:328 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #21 0x563ec7dc608d in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:276 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #22 0x563ec7beca14 in flb_input_collector_fd() at src/flb_input.c:1918 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #23 0x563ec7c29b9a in flb_engine_handle_event() at src/flb_engine.c:503 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #24 0x563ec7c29b9a in flb_engine_start() at src/flb_engine.c:866 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #25 0x563ec7bce531 in flb_lib_worker() at src/flb_lib.c:638 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #26 0x7f51a111aea6 in ???() at ???:0 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #27 0x7f51a09cda2e in ???() at ???:0 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #28 0xffffffffffffffff in ???() at ???:0

Have you got any fixes for the same @kiyutink

neha130 avatar Aug 02 '24 11:08 neha130

Hi guys, could you please upload any conflictive chunk samples you have so we can analyze and correct this issue? If you cannot upload those to a public location feel free to contact me in the fluent slack server to provide them in a private manner.

leonardo-albertovich avatar Aug 02 '24 13:08 leonardo-albertovich

I'm working on getting the chunk to be approved to be shared, but might take some time 😓

kiyutink avatar Aug 05 '24 11:08 kiyutink

In the meantime, is there a way I could manipulate the chunk contents to try and locate the specific log line that causes issues? If i manipulate it directly, the chunk is considered corrupt (i guess there's checksums)

kiyutink avatar Aug 05 '24 11:08 kiyutink

Yes, there's a checksum that you could either disable in your test system (since you are just injecting one cold chunk) with no side effects or you could re-calculate.

Here's what I'd do in this case :

  1. Consult chunkios documentation for the file structure to ensure that I have a clear understanding of the right offsets and fields I need to touch (checksum, file size, data)
  2. Create a simple python script to individually extract the header and data sections of the chunk
  3. Create a simple python script to create a chunk file per item in the data section
  4. Create a bash script to place one chunk at the time in cold storage, launch fluent-bit with a minimal configuration that routes chunks to the NULL output plugin and exits afterwards and then check the return code and/or the cold storage directory to automatically detect which one of these chunks breaks the system

You can disable checksums and expect the system to behave properly and you can also set the content length field to zero if the file you are creating is not padded (ie. it's only as large as required in order to fit the header and data).

If you have any questions or need some help with those scripts let me know and I'll assist you.

leonardo-albertovich avatar Aug 05 '24 12:08 leonardo-albertovich

Fixed in PR #9194

leonardo-albertovich avatar Aug 12 '24 22:08 leonardo-albertovich

See also #9192.

igorpeshansky avatar Aug 12 '24 23:08 igorpeshansky

Hey @igorpeshansky @leonardo-albertovich @edsiper , I updated to the Fluent Bit image as you suggested. However, I’m still encountering a similar issue through which pod is getting restarts as mentioned below:

[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: [2024/08/27 06:08:19] [engine] caught signal (SIGSEGV)
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #0  0x55821ac35d0f      in  arena_bin_slabs_nonfull_remove() at lib/jemalloc-5.3.0/src/arena.c:587
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #1  0x55821ac35d0f      in  arena_dissociate_bin_slab() at lib/jemalloc-5.3.0/src/arena.c:1311
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #2  0x55821ac35d0f      in  je_arena_dalloc_bin_locked_handle_newly_empty() at lib/jemalloc-5.3.0/src/arena.c:1356
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #3  0x55821ac95d6f      in  arena_dalloc_bin_locked_step() at lib/jemalloc-5.3.0/include/jemalloc/internal/arena_inlines_b.h:524
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #4  0x55821ac95d6f      in  tcache_bin_flush_impl() at lib/jemalloc-5.3.0/src/tcache.c:448
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #5  0x55821ac95d6f      in  tcache_bin_flush_bottom() at lib/jemalloc-5.3.0/src/tcache.c:519
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #6  0x55821ac95d6f      in  je_tcache_bin_flush_small() at lib/jemalloc-5.3.0/src/tcache.c:529
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #7  0x55821ac95e52      in  tcache_gc_small() at lib/jemalloc-5.3.0/src/tcache.c:148
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #8  0x55821ac96901      in  ???() at lib/jemalloc-5.3.0/src/tcache.c:223
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #9  0x55821ac99122      in  je_te_event_trigger() at lib/jemalloc-5.3.0/src/thread_event.c:299
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #10 0x55821ac2b8ac      in  te_event_advance() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:287
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #11 0x55821ac2b8ac      in  thread_dalloc_event() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:293
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #12 0x55821ac2b8ac      in  ifree() at lib/jemalloc-5.3.0/src/jemalloc.c:2896
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #13 0x55821ac2b8ac      in  je_free_default() at lib/jemalloc-5.3.0/src/jemalloc.c:3021
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #14 0x55821ad3632e      in  flb_free() at include/fluent-bit/flb_mem.h:127
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #15 0x55821ad3632e      in  flb_tail_file_is_rotated() at plugins/in_tail/tail_file.c:1663
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #16 0x55821ad2f579      in  in_tail_watcher_callback() at plugins/in_tail/tail.c:328
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #17 0x55821acb714a      in  flb_input_collector_fd() at src/flb_input.c:1970
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #18 0x55821acd0c73      in  flb_engine_handle_event() at src/flb_engine.c:575
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #19 0x55821acd0c73      in  flb_engine_start() at src/flb_engine.c:941
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #20 0x55821acac153      in  flb_lib_worker() at src/flb_lib.c:674
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #21 0x7f55fbbde133      in  ???() at ???:0
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #22 0x7f55fbc5e7db      in  ???() at ???:0
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #23 0xffffffffffffffff  in  ???() at ???:0
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: [2024/08/27 06:08:19] [ info] [input:tail:tail.3] inotify_fs_remove(): inode=72352586 watch_fd=540

neha130 avatar Aug 27 '24 06:08 neha130