librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

unexpected segmentation fault in sasl_client_new

Open litao3rd opened this issue 6 months ago • 0 comments

Hi librdkafka team,

I am encountering unexpected terminations (segmentation faults, signal 11) in a long-running C++ application that statically links against librdkafka and its dependencies. These terminations occur intermittently without any preceding error messages in our application logs.

I have captured a coredump of one such instance, and the backtrace from GDB is as follows:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `${cmd}'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000289a000 in tcmalloc::Static::pageheap_ ()
Missing separate debuginfos, use: debuginfo-install passivedns-0.0.245-1.x86_64
(gdb) bt
#0  0x000000000289a000 in tcmalloc::Static::pageheap_ ()
#1  0x0000000001c7ebde in _sasl_global_getopt ()
#2  0x0000000001c7d072 in sasl_client_new ()
#3  0x0000000001ab2148 in rd_kafka_sasl_cyrus_client_new ()
#4  0x00000000019253c0 in rd_kafka_sasl_client_new ()
#5  0x00000000019941a4 in rd_kafka_broker_connect_auth.part.0 ()
#6  0x0000000001995825 in rd_kafka_broker_handle_SaslHandshake ()
#7  0x000000000197af74 in rd_kafka_buf_callback ()
#8  0x0000000001999107 in rd_kafka_recv ()
#9  0x000000000197c698 in rd_kafka_transport_io_event.constprop.0 ()
#10 0x000000000197d2cc in rd_kafka_transport_io_serve ()
#11 0x000000000199f10c in rd_kafka_broker_ops_io_serve ()
#12 0x000000000199f62d in rd_kafka_broker_consumer_serve ()
#13 0x000000000199fd69 in rd_kafka_broker_serve ()
#14 0x00000000019a0235 in rd_kafka_broker_thread_main ()
#15 0x0000000001932b87 in _thrd_wrapper_function ()
#16 0x00007fa284cceea5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007fa28397596d in clone () from /lib64/libc.so.6

The backtrace suggests the issue might be related to the SASL authentication process within librdkafka, potentially involving tcmalloc.

To help us debug this fatal error, could you please provide insights into:

  1. Potential causes of this segmentation fault within the SASL authentication flow? Are there any known issues or common misconfigurations that could lead to this?
  2. Recommended debugging steps specific to this backtrace? Are there any specific librdkafka configurations or logging levels that might provide more context?
  3. Best practices for handling such errors in a long-running application to prevent unexpected process termination? While we aim to resolve the root cause, are there any mechanisms within librdkafka or general C++ practices that can help gracefully handle errors occurring within the library and prevent the entire process from crashing?

We are using a statically linked build of librdkafka. If there's any specific information about our build environment or Kafka broker configuration that would be helpful, please let us know.

version informations

cyrus-sasl    2.1.28
krb5      1.21.1
librdkafka   2.8.0
tcmalloc  2.16 (intergrated with gperftools)

Thank you for your time and assistance.

litao3rd avatar Apr 22 '25 05:04 litao3rd