fluent-bit
fluent-bit copied to clipboard
[out_splunk] SIGSEGV with specific chunks
Bug Report
Describe the bug
We're running a set up where we run fluent-bit with a forward input and filesystem buffering. Periodically a chunk lands on the filesystem that upon read crashes fluent-bit. That means every time it restarts it crashes as it tries to replay this chunk from backlog. We haven't been able to reproduce this issue reliably (apart from trying to load it with the faulty chunk, in which case it crashes). I unfortunately can't share the chunk as it contains customer data.
Here's the log line we're seeing upon the crash:
[2024/06/21 10:16:25] [engine] caught signal (SIGSEGV)
#0 0x55ae116e4ad3 in msgpack2json() at src/flb_pack.c:731
#1 0x55ae116e4ad3 in msgpack2json() at src/flb_pack.c:731
#2 0x55ae116e4ad3 in msgpack2json() at src/flb_pack.c:731
#3 0x55ae116e4ad3 in msgpack2json() at src/flb_pack.c:731
#4 0x55ae116e533a in flb_msgpack_to_json() at src/flb_pack.c:768
#5 0x55ae116e5457 in flb_msgpack_raw_to_json_sds() at src/flb_pack.c:808
#6 0x55ae117e5bc3 in splunk_format() at plugins/out_splunk/splunk.c:500
#7 0x55ae117e6424 in cb_splunk_flush() at plugins/out_splunk/splunk.c:658
#8 0x55ae11c03ae6 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9 0xffffffffffffffff in ???() at ???:0
Your Environment
- Version used: 3.0.6
- Configuration:
[SERVICE]
HTTP_Server On
Health_Check On
Storage.max_chunks_up 512
Storage.backlog.mem_limit 100M
Storage.path /var/log/flb-storage/
Storage.sync normal
Storage.metrics On
[INPUT]
Name Forward
Storage.type filesystem
[OUTPUT]
Name Splunk
Match *
Host <our host>
Port 443
Splunk_Token ${SPLUNK_TOKEN}
TLS On
TLS.Verify On
Event_index <index>
Event_sourcetype fluentd
Retry_Limit False
Storage.total_limit_size 10GB
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes (but we can also reproduce locally if we offload the same chunk). We're using eks kubernetes v1.28
- Server type and version:
- Operating System and version:
- Filters and plugins:
Additional context
We tried adjusting the chunk to determine which lines specifically in it might cause the crash, but were unable to manipulate the chunks in a way that would make them readable.
Please advise what we could try to assist with solving this somehow.
We have a strong suspicion that this appeared after we bumped to v3. the only somewhat relevant code change we found was here: https://github.com/fluent/fluent-bit/pull/8589/files
Can you try the latest 3.0.7?
Can you try the latest 3.0.7?
- Same behavior in 3.0.7
- Same behavior in 3.0.0
- Doesn't crash in 2.2.3
please upload the chunks that helps to reproduce the issue here (if you have sensitive data we can keep it private and share it through a different channel)
Fluent Bit Version : 2.1.2
Kubernetes Version: 1.30
Hello @edsiper Facing the same issue intermittently at our end as well. But the frequency of restarts is high, almost everyday atleast 5-6 times fluent-bit pod crashes with below logs :
[Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: [2024/07/25 21:09:16] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=26361793 watch_fd=11120 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: [2024/07/25 21:09:16] [ info] [input:tail:tail.3] inotify_fs_remove(): inode=26361793 watch_fd=11115 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: [2024/07/25 21:09:16] [ warn] [filter:kubernetes:kubernetes.0] invalid pattern for given tag kube.devtroncd.bp-dcd-cas-bp-dcd-aws-cluster-autoscaler-8588c68b4d-wdj2m.aws-cluster-autoscaler [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: [2024/07/25 21:09:16] [engine] caught signal (SIGSEGV) [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #0 0x563ec7bb2304 in edata_arena_ind_get() at lib/jemalloc-5.3.0/include/jemalloc/internal/edata.h:258 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #1 0x563ec7bb2304 in tcache_bin_flush_match() at lib/jemalloc-5.3.0/src/tcache.c:301 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #2 0x563ec7bb2304 in tcache_bin_flush_impl() at lib/jemalloc-5.3.0/src/tcache.c:434 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #3 0x563ec7bb2304 in tcache_bin_flush_bottom() at lib/jemalloc-5.3.0/src/tcache.c:519 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #4 0x563ec7bb2304 in je_tcache_bin_flush_small() at lib/jemalloc-5.3.0/src/tcache.c:529 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #5 0x563ec7bb3699 in tcache_gc_small() at lib/jemalloc-5.3.0/src/tcache.c:148 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #6 0x563ec7bb5751 in ???() at lib/jemalloc-5.3.0/src/tcache.c:414 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #7 0x563ec7bb800f in je_te_event_trigger() at lib/jemalloc-5.3.0/src/thread_event.c:299 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #8 0x563ec7b4908c in te_event_advance() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:287 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #9 0x563ec7b4908c in thread_dalloc_event() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:293 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #10 0x563ec7b4908c in ifree() at lib/jemalloc-5.3.0/src/jemalloc.c:2896 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #11 0x563ec7b4908c in je_free_default() at lib/jemalloc-5.3.0/src/jemalloc.c:3021 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #12 0x563ec7c7d04a in msgpack_sbuffer_destroy() at lib/msgpack-c/include/msgpack/sbuffer.h:41 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #13 0x563ec7c80135 in flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1532 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #14 0x563ec7c7dbff in package_content() at src/multiline/flb_ml.c:335 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #15 0x563ec7c7e031 in process_append() at src/multiline/flb_ml.c:479 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #16 0x563ec7c7e55f in ml_append_try_parser() at src/multiline/flb_ml.c:637 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #17 0x563ec7c7e678 in flb_ml_append_text() at src/multiline/flb_ml.c:679 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #18 0x563ec7dd4a14 in process_content() at plugins/in_tail/tail_file.c:505 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #19 0x563ec7dd75ad in flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1413 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #20 0x563ec7dc1564 in in_tail_collect_event() at plugins/in_tail/tail.c:328 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #21 0x563ec7dc608d in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:276 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #22 0x563ec7beca14 in flb_input_collector_fd() at src/flb_input.c:1918 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #23 0x563ec7c29b9a in flb_engine_handle_event() at src/flb_engine.c:503 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #24 0x563ec7c29b9a in flb_engine_start() at src/flb_engine.c:866 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #25 0x563ec7bce531 in flb_lib_worker() at src/flb_lib.c:638 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #26 0x7f51a111aea6 in ???() at ???:0 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #27 0x7f51a09cda2e in ???() at ???:0 [Jul 26 2024 02:39:16 GMT+0530] bp-flb-bp-utils-fluent-bit-jwg4v: #28 0xffffffffffffffff in ???() at ???:0
Have you got any fixes for the same @kiyutink
Hi guys, could you please upload any conflictive chunk samples you have so we can analyze and correct this issue? If you cannot upload those to a public location feel free to contact me in the fluent slack server to provide them in a private manner.
I'm working on getting the chunk to be approved to be shared, but might take some time 😓
In the meantime, is there a way I could manipulate the chunk contents to try and locate the specific log line that causes issues? If i manipulate it directly, the chunk is considered corrupt (i guess there's checksums)
Yes, there's a checksum that you could either disable in your test system (since you are just injecting one cold chunk) with no side effects or you could re-calculate.
Here's what I'd do in this case :
- Consult chunkios documentation for the file structure to ensure that I have a clear understanding of the right offsets and fields I need to touch (checksum, file size, data)
- Create a simple python script to individually extract the header and data sections of the chunk
- Create a simple python script to create a chunk file per item in the data section
- Create a bash script to place one chunk at the time in cold storage, launch fluent-bit with a minimal configuration that routes chunks to the NULL output plugin and exits afterwards and then check the return code and/or the cold storage directory to automatically detect which one of these chunks breaks the system
You can disable checksums and expect the system to behave properly and you can also set the content length field to zero if the file you are creating is not padded (ie. it's only as large as required in order to fit the header and data).
If you have any questions or need some help with those scripts let me know and I'll assist you.
Fixed in PR #9194
See also #9192.
Hey @igorpeshansky @leonardo-albertovich @edsiper , I updated to the Fluent Bit image as you suggested. However, I’m still encountering a similar issue through which pod is getting restarts as mentioned below:
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: [2024/08/27 06:08:19] [engine] caught signal (SIGSEGV)
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #0 0x55821ac35d0f in arena_bin_slabs_nonfull_remove() at lib/jemalloc-5.3.0/src/arena.c:587
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #1 0x55821ac35d0f in arena_dissociate_bin_slab() at lib/jemalloc-5.3.0/src/arena.c:1311
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #2 0x55821ac35d0f in je_arena_dalloc_bin_locked_handle_newly_empty() at lib/jemalloc-5.3.0/src/arena.c:1356
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #3 0x55821ac95d6f in arena_dalloc_bin_locked_step() at lib/jemalloc-5.3.0/include/jemalloc/internal/arena_inlines_b.h:524
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #4 0x55821ac95d6f in tcache_bin_flush_impl() at lib/jemalloc-5.3.0/src/tcache.c:448
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #5 0x55821ac95d6f in tcache_bin_flush_bottom() at lib/jemalloc-5.3.0/src/tcache.c:519
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #6 0x55821ac95d6f in je_tcache_bin_flush_small() at lib/jemalloc-5.3.0/src/tcache.c:529
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #7 0x55821ac95e52 in tcache_gc_small() at lib/jemalloc-5.3.0/src/tcache.c:148
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #8 0x55821ac96901 in ???() at lib/jemalloc-5.3.0/src/tcache.c:223
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #9 0x55821ac99122 in je_te_event_trigger() at lib/jemalloc-5.3.0/src/thread_event.c:299
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #10 0x55821ac2b8ac in te_event_advance() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:287
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #11 0x55821ac2b8ac in thread_dalloc_event() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:293
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #12 0x55821ac2b8ac in ifree() at lib/jemalloc-5.3.0/src/jemalloc.c:2896
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #13 0x55821ac2b8ac in je_free_default() at lib/jemalloc-5.3.0/src/jemalloc.c:3021
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #14 0x55821ad3632e in flb_free() at include/fluent-bit/flb_mem.h:127
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #15 0x55821ad3632e in flb_tail_file_is_rotated() at plugins/in_tail/tail_file.c:1663
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #16 0x55821ad2f579 in in_tail_watcher_callback() at plugins/in_tail/tail.c:328
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #17 0x55821acb714a in flb_input_collector_fd() at src/flb_input.c:1970
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #18 0x55821acd0c73 in flb_engine_handle_event() at src/flb_engine.c:575
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #19 0x55821acd0c73 in flb_engine_start() at src/flb_engine.c:941
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #20 0x55821acac153 in flb_lib_worker() at src/flb_lib.c:674
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #21 0x7f55fbbde133 in ???() at ???:0
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #22 0x7f55fbc5e7db in ???() at ???:0
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: #23 0xffffffffffffffff in ???() at ???:0
[Aug 27 2024 11:38:19 GMT+0530] delh-mumbai-fluent-del-utils-mum-fluent-bit-rtmbg: [2024/08/27 06:08:19] [ info] [input:tail:tail.3] inotify_fs_remove(): inode=72352586 watch_fd=540