fluent-bit
fluent-bit copied to clipboard
Stackdriver plugin SIGSEGV with workers > 1
Bug Report
The stackdriver plugin will run into SIGSEGV when the workers options are set for > 1.
There are other similar issues created by users and I summarized them in the below list: https://github.com/fluent/fluent-bit/issues/5018 https://github.com/fluent/fluent-bit/issues/5048
and I also met this issue in my cluster.
The error logs will be similar to:
[2022/03/10 16:11:07] [engine] caught signal (SIGSEGV)
#0 0x55772d8a8552 in __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:88
#1 0x55772d8a857d in mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2 0x55772d8a87fe in flb_kv_item_destroy() at src/flb_kv.c:90
#3 0x55772d8a8853 in flb_kv_release() at src/flb_kv.c:102
#4 0x55772d8aef13 in http_headers_destroy() at src/flb_http_client.c:1002
#5 0x55772d8af96c in flb_http_client_destroy() at src/flb_http_client.c:1328
#6 0x55772d93357b in cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2323
#7 0x55772d87bc0e in output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#8 0x55772dd7f166 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
and my wild guessing is the flb_http_client_destroy() touches a connection in other workers that is already destroyed.
- json
[2022/03/15 14:42:27] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4128783 watch_fd=34
[2022/03/15 14:43:44] [engine] caught signal (SIGSEGV)
[2022/03/15 14:43:44] [ warn] [output:stackdriver:stackdriver.0] error
{
"error": {
"code": 400,
"message": "Invalid JSON payload received. Closing quote expected in string.\n\n^",
"status": "INVALID_ARGUMENT"
}
}
#0 0x558f6fda111c in flb_sds_len() at include/fluent-bit/flb_sds.h:50
#1 0x558f6fda2db6 in http_header_push() at src/flb_http_client.c:929
#2 0x558f6fda2feb in http_headers_compose() at src/flb_http_client.c:989
#3 0x558f6fda3439 in flb_http_do() at src/flb_http_client.c:1127
#4 0x558f6fe419d9 in cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2272
#5 0x558f6fd68ef2 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#6 0x558f702f444a in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#7 0xffffffffffffffff in ???() at ???:0
Aborted
To Reproduce
I can't reproduce it in my machine locally by running fluent-bit as process, instead, I can reproduce it under my cluster environment.
The weird thing here is that I can't reproduce it within normal GKE cluster but I could reproduce it under Google Anthos Cluster.
If you see the similar error and have an easy way to reproduce the error, appreciate if you can shared the steps here : )
Expected behavior
Fluent-bit running inside container (Kubernetes environment) won't crash when the workers is set for > 1. This is crucial to improve the performance of fluent-bit.
Screenshots
Your Environment
- Version used: v1.8.12 v1.8.13
- Configuration:
- Environment name and version (e.g. Kubernetes? What version?):
- Server type and version:
- Operating System and version:
- Filters and plugins:
Additional context
GDB logs:
[2022/03/15 17:16:14] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4128783 watch_fd=35
[2022/03/15 17:16:59] [ info] [input:tail:tail.0] inotify_fs_add(): inode=4128785 watch_fd=36 name=/var/log/containers/log-generator-gbdlb_kube-system_log-generator-f38b4f4c72036fee5c7809429ea6b5e436ab2b2a3e0439fd186c50c5e0a2a7f1.log
Thread 8 "flb-out-stackdr" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdfbfe700 (LWP 588708)]
je_tcache_bin_flush_small (tsd=<optimized out>, tcache=<optimized out>, tbin=0x7fffdfbf4f10, binind=<optimized out>, rem=<optimized out>)
at /home/jeffluoo/fluent-bit/lib/jemalloc-5.2.1/src/tcache.c:187
Does this occur with the latest 1.9.0 release? That now sets default worker values > 1 hence my concern.
If you see the issue https://github.com/fluent/fluent-bit/issues/5018 that I linked. They uses v1.9.0 and I suggested setting workers to 0 that fixed the issue temporarily.
gentle ping
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
/unstale
Trying with v1.9.6 and observed following error:
[2022/08/11 14:26:08] [ info] [input:tail:tail.1] inotify_fs_add(): inode=117692675 watch_fd=21 name=/var/log/containers/stackdriver-log-forwarder-rp6wf_kube-system_stackdriver-log-forwarder-b8b433b53e4e9b4b86b6b84c286d2acd7d914ba4ceb2886674437312a86807d9.log
[2022/08/11 14:26:35] [ warn] [output:stackdriver:stackdriver.0] error
{
"error": {
"code": 400,
"message": "Invalid JSON payload received. Expected an object key or }.\ntriesla\"}},98:101}\n ^",
"status": "INVALID_ARGUMENT"
}
}
[2022/08/11 14:26:40] [error] [tls] error: error:00000005:lib(0):func(0):DH lib
[2022/08/11 14:26:40] [error] [src/flb_http_client.c:1199 errno=32] Broken pipe
[2022/08/11 14:26:40] [ warn] [output:stackdriver:stackdriver.0] http_do=-1
[2022/08/11 14:26:45] [error] [output:stackdriver:stackdriver.0] error formatting JSON payload
[2022/08/11 14:26:45] [ warn] [output:stackdriver:stackdriver.0] error
{
"error": {
"code": 400,
"message": "Invalid JSON payload received. Expected an object key or }.\n\"namespace_name\":\"\",103:107,101:45}},99:\n ^",
"status": "INVALID_ARGUMENT"
}
}
[2022/08/11 14:26:45] [ warn] [output:stackdriver:stackdriver.0] error
{
"error": {
"code": 400,
"message": "Invalid JSON payload received. Expected an object key or }.\n\"namespace_name\":\"\",103:107,101:45}},\"lo\n ^",
"status": "INVALID_ARGUMENT"
}
}
[2022/08/11 14:26:51] [ warn] unknown time format 5
[2022/08/11 14:26:51] [ warn] unknown time format 5
[2022/08/11 14:26:51] [ warn] [output:stackdriver:stackdriver.0] error
{
"error": {
"code": 400,
"message": "Invalid value at 'labels[3].value' (TYPE_STRING), 404\nInvalid JSON payload received. Unknown name \"stream\": Cannot find field.",
"status": "INVALID_ARGUMENT",
"details": [
{
"@type": "type.googleapis.com/google.rpc.BadRequest",
"fieldViolations": [
{
"field": "labels[3].value",
"description": "Invalid value at 'labels[3].value' (TYPE_STRING), 404"
},
{
"description": "Invalid JSON payload received. Unknown name \"stream\": Cannot find field."
}
]
}
]
}
}
[2022/08/11 14:26:55] [ warn] [output:stackdriver:stackdriver.0] error
{
"error": {
"code": 400,
"message": "Invalid JSON payload received. Unknown name \"stream\": Cannot find field.",
"status": "INVALID_ARGUMENT",
"details": [
{
"@type": "type.googleapis.com/google.rpc.BadRequest",
"fieldViolations": [
{
"description": "Invalid JSON payload received. Unknown name \"stream\": Cannot find field."
}
]
}
]
}
}
[2022/08/11 14:26:57] [engine] caught signal (SIGSEGV)
#0 0x7f3b32031e52 in ???() at 4/multiarch/memmove-vec-unaligned-erms.S:521
#1 0x5619a7ef474f in flb_sds_create_len() at src/flb_sds.c:68
#2 0x5619a7fcef74 in cb_results() at plugins/out_stackdriver/stackdriver.c:1013
#3 0x5619a7f2c8d1 in cb_onig_named() at src/flb_regex.c:46
#4 0x5619a808a7ea in i_names() at lib/onigmo/regparse.c:563
#5 0x5619a80b748e in st_general_foreach() at lib/onigmo/st.c:1505
#6 0x5619a80b748e in onig_st_foreach() at lib/onigmo/st.c:1569
#7 0x5619a8090e8c in onig_foreach_name() at lib/onigmo/regparse.c:588
#8 0x5619a7f2cd5d in flb_regex_parse() at src/flb_regex.c:240
#9 0x5619a7fceb04 in extract_resource_labels_from_regex() at plugins/out_stackdriver/stackdriver.c:890
#10 0x5619a7fceb52 in process_local_resource_id() at plugins/out_stackdriver/stackdriver.c:902
#11 0x5619a7fd0ba2 in stackdriver_format() at plugins/out_stackdriver/stackdriver.c:1722
#12 0x5619a7fd27f3 in cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2290
#13 0x5619a7efdc3e in output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#14 0x5619a8444a06 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#15 0xffffffffffffffff in ???() at ???:0
Still an issue for v1.9.9:
[2022/10/19 21:23:16] [error] [output:stackdriver:stackdriver.0] error formatting JSON payload
[2022/10/19 21:23:16] [ warn] [engine] failed to flush chunk '1-1666214595.313586418.flb', retry in 9 seconds: task_id=1, input=tail.1 > output=stackdriver.0 (out_id=0)
[2022/10/19 21:23:17] [engine] caught signal (SIGSEGV)
#0 0x7f76821ad5f3 in ???() at ???:0
#1 0x7f76821afd0b in ???() at ???:0
#2 0x7f76821b0aa4 in ???() at ???:0
#3 0x7f76821c3872 in ???() at ???:0
#4 0x55ecee6ab43c in tls_net_write() at src/tls/openssl.c:440
#5 0x55ecee6abbb6 in flb_tls_net_write_async() at src/tls/flb_tls.c:278
#6 0x55ecee6b9013 in flb_io_net_write() at src/flb_io.c:421
#7 0x55ecee6bb5a6 in flb_http_do() at src/flb_http_client.c:1183
#8 0x55ecee755233 in cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2343
#9 0x55ecee67fd57 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#10 0x55eceebc4f46 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#11 0xffffffffffffffff in ???() at ???:0
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.