fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Stackdriver plugin SIGSEGV with workers > 1

Open JeffLuoo opened this issue 3 years ago • 7 comments

Bug Report

The stackdriver plugin will run into SIGSEGV when the workers options are set for > 1.

There are other similar issues created by users and I summarized them in the below list: https://github.com/fluent/fluent-bit/issues/5018 https://github.com/fluent/fluent-bit/issues/5048

and I also met this issue in my cluster.

The error logs will be similar to:

[2022/03/10 16:11:07] [engine] caught signal (SIGSEGV)
#0  0x55772d8a8552      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:88
#1  0x55772d8a857d      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2  0x55772d8a87fe      in  flb_kv_item_destroy() at src/flb_kv.c:90
#3  0x55772d8a8853      in  flb_kv_release() at src/flb_kv.c:102
#4  0x55772d8aef13      in  http_headers_destroy() at src/flb_http_client.c:1002
#5  0x55772d8af96c      in  flb_http_client_destroy() at src/flb_http_client.c:1328
#6  0x55772d93357b      in  cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2323
#7  0x55772d87bc0e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#8  0x55772dd7f166      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

and my wild guessing is the flb_http_client_destroy() touches a connection in other workers that is already destroyed.

  1. json
[2022/03/15 14:42:27] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4128783 watch_fd=34
[2022/03/15 14:43:44] [engine] caught signal (SIGSEGV)
[2022/03/15 14:43:44] [ warn] [output:stackdriver:stackdriver.0] error
{
  "error": {
    "code": 400,
    "message": "Invalid JSON payload received. Closing quote expected in string.\n\n^",
    "status": "INVALID_ARGUMENT"
  }
}

#0  0x558f6fda111c      in  flb_sds_len() at include/fluent-bit/flb_sds.h:50
#1  0x558f6fda2db6      in  http_header_push() at src/flb_http_client.c:929
#2  0x558f6fda2feb      in  http_headers_compose() at src/flb_http_client.c:989
#3  0x558f6fda3439      in  flb_http_do() at src/flb_http_client.c:1127
#4  0x558f6fe419d9      in  cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2272
#5  0x558f6fd68ef2      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#6  0x558f702f444a      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#7  0xffffffffffffffff  in  ???() at ???:0
Aborted

To Reproduce

I can't reproduce it in my machine locally by running fluent-bit as process, instead, I can reproduce it under my cluster environment.

The weird thing here is that I can't reproduce it within normal GKE cluster but I could reproduce it under Google Anthos Cluster.

If you see the similar error and have an easy way to reproduce the error, appreciate if you can shared the steps here : )

Expected behavior

Fluent-bit running inside container (Kubernetes environment) won't crash when the workers is set for > 1. This is crucial to improve the performance of fluent-bit.

Screenshots

Your Environment

  • Version used: v1.8.12 v1.8.13
  • Configuration:
  • Environment name and version (e.g. Kubernetes? What version?):
  • Server type and version:
  • Operating System and version:
  • Filters and plugins:

Additional context

JeffLuoo avatar Mar 15 '22 14:03 JeffLuoo

GDB logs:

[2022/03/15 17:16:14] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4128783 watch_fd=35
[2022/03/15 17:16:59] [ info] [input:tail:tail.0] inotify_fs_add(): inode=4128785 watch_fd=36 name=/var/log/containers/log-generator-gbdlb_kube-system_log-generator-f38b4f4c72036fee5c7809429ea6b5e436ab2b2a3e0439fd186c50c5e0a2a7f1.log

Thread 8 "flb-out-stackdr" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdfbfe700 (LWP 588708)]
je_tcache_bin_flush_small (tsd=<optimized out>, tcache=<optimized out>, tbin=0x7fffdfbf4f10, binind=<optimized out>, rem=<optimized out>)
    at /home/jeffluoo/fluent-bit/lib/jemalloc-5.2.1/src/tcache.c:187

JeffLuoo avatar Mar 15 '22 17:03 JeffLuoo

Does this occur with the latest 1.9.0 release? That now sets default worker values > 1 hence my concern.

patrick-stephens avatar Mar 16 '22 10:03 patrick-stephens

If you see the issue https://github.com/fluent/fluent-bit/issues/5018 that I linked. They uses v1.9.0 and I suggested setting workers to 0 that fixed the issue temporarily.

JeffLuoo avatar Mar 16 '22 14:03 JeffLuoo

gentle ping

JeffLuoo avatar Apr 10 '22 16:04 JeffLuoo

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Jul 10 '22 02:07 github-actions[bot]

/unstale

JeffLuoo avatar Jul 11 '22 13:07 JeffLuoo

Trying with v1.9.6 and observed following error:

[2022/08/11 14:26:08] [ info] [input:tail:tail.1] inotify_fs_add(): inode=117692675 watch_fd=21 name=/var/log/containers/stackdriver-log-forwarder-rp6wf_kube-system_stackdriver-log-forwarder-b8b433b53e4e9b4b86b6b84c286d2acd7d914ba4ceb2886674437312a86807d9.log
[2022/08/11 14:26:35] [ warn] [output:stackdriver:stackdriver.0] error
{
  "error": {
    "code": 400,
    "message": "Invalid JSON payload received. Expected an object key or }.\ntriesla\"}},98:101}\n                    ^",
    "status": "INVALID_ARGUMENT"
  }
}

[2022/08/11 14:26:40] [error] [tls] error: error:00000005:lib(0):func(0):DH lib
[2022/08/11 14:26:40] [error] [src/flb_http_client.c:1199 errno=32] Broken pipe
[2022/08/11 14:26:40] [ warn] [output:stackdriver:stackdriver.0] http_do=-1
[2022/08/11 14:26:45] [error] [output:stackdriver:stackdriver.0] error formatting JSON payload
[2022/08/11 14:26:45] [ warn] [output:stackdriver:stackdriver.0] error
{
  "error": {
    "code": 400,
    "message": "Invalid JSON payload received. Expected an object key or }.\n\"namespace_name\":\"\",103:107,101:45}},99:\n                    ^",
    "status": "INVALID_ARGUMENT"
  }
}

[2022/08/11 14:26:45] [ warn] [output:stackdriver:stackdriver.0] error
{
  "error": {
    "code": 400,
    "message": "Invalid JSON payload received. Expected an object key or }.\n\"namespace_name\":\"\",103:107,101:45}},\"lo\n                    ^",
    "status": "INVALID_ARGUMENT"
  }
}

[2022/08/11 14:26:51] [ warn] unknown time format 5
[2022/08/11 14:26:51] [ warn] unknown time format 5
[2022/08/11 14:26:51] [ warn] [output:stackdriver:stackdriver.0] error
{
  "error": {
    "code": 400,
    "message": "Invalid value at 'labels[3].value' (TYPE_STRING), 404\nInvalid JSON payload received. Unknown name \"stream\": Cannot find field.",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.BadRequest",
        "fieldViolations": [
          {
            "field": "labels[3].value",
            "description": "Invalid value at 'labels[3].value' (TYPE_STRING), 404"
          },
          {
            "description": "Invalid JSON payload received. Unknown name \"stream\": Cannot find field."
          }
        ]
      }
    ]
  }
}

[2022/08/11 14:26:55] [ warn] [output:stackdriver:stackdriver.0] error
{
  "error": {
    "code": 400,
    "message": "Invalid JSON payload received. Unknown name \"stream\": Cannot find field.",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.BadRequest",
        "fieldViolations": [
          {
            "description": "Invalid JSON payload received. Unknown name \"stream\": Cannot find field."
          }
        ]
      }
    ]
  }
}

[2022/08/11 14:26:57] [engine] caught signal (SIGSEGV)
#0  0x7f3b32031e52      in  ???() at 4/multiarch/memmove-vec-unaligned-erms.S:521
#1  0x5619a7ef474f      in  flb_sds_create_len() at src/flb_sds.c:68
#2  0x5619a7fcef74      in  cb_results() at plugins/out_stackdriver/stackdriver.c:1013
#3  0x5619a7f2c8d1      in  cb_onig_named() at src/flb_regex.c:46
#4  0x5619a808a7ea      in  i_names() at lib/onigmo/regparse.c:563
#5  0x5619a80b748e      in  st_general_foreach() at lib/onigmo/st.c:1505
#6  0x5619a80b748e      in  onig_st_foreach() at lib/onigmo/st.c:1569
#7  0x5619a8090e8c      in  onig_foreach_name() at lib/onigmo/regparse.c:588
#8  0x5619a7f2cd5d      in  flb_regex_parse() at src/flb_regex.c:240
#9  0x5619a7fceb04      in  extract_resource_labels_from_regex() at plugins/out_stackdriver/stackdriver.c:890
#10 0x5619a7fceb52      in  process_local_resource_id() at plugins/out_stackdriver/stackdriver.c:902
#11 0x5619a7fd0ba2      in  stackdriver_format() at plugins/out_stackdriver/stackdriver.c:1722
#12 0x5619a7fd27f3      in  cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2290
#13 0x5619a7efdc3e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#14 0x5619a8444a06      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#15 0xffffffffffffffff  in  ???() at ???:0

JeffLuoo avatar Aug 11 '22 14:08 JeffLuoo

Still an issue for v1.9.9:

[2022/10/19 21:23:16] [error] [output:stackdriver:stackdriver.0] error formatting JSON payload
[2022/10/19 21:23:16] [ warn] [engine] failed to flush chunk '1-1666214595.313586418.flb', retry in 9 seconds: task_id=1, input=tail.1 > output=stackdriver.0 (out_id=0)
[2022/10/19 21:23:17] [engine] caught signal (SIGSEGV)
#0  0x7f76821ad5f3      in  ???() at ???:0
#1  0x7f76821afd0b      in  ???() at ???:0
#2  0x7f76821b0aa4      in  ???() at ???:0
#3  0x7f76821c3872      in  ???() at ???:0
#4  0x55ecee6ab43c      in  tls_net_write() at src/tls/openssl.c:440
#5  0x55ecee6abbb6      in  flb_tls_net_write_async() at src/tls/flb_tls.c:278
#6  0x55ecee6b9013      in  flb_io_net_write() at src/flb_io.c:421
#7  0x55ecee6bb5a6      in  flb_http_do() at src/flb_http_client.c:1183
#8  0x55ecee755233      in  cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2343
#9  0x55ecee67fd57      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#10 0x55eceebc4f46      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#11 0xffffffffffffffff  in  ???() at ???:0

JeffLuoo avatar Oct 19 '22 21:10 JeffLuoo

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Jan 18 '23 02:01 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Jan 24 '23 02:01 github-actions[bot]