librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

Segfault whenever an `on_conf_destroy` callback is registered

Open Quuxplusone opened this issue 2 years ago • 1 comments

$ cd examples
$ cat >consumer.c <<EOF
#include <assert.h>
#include <stdio.h>
#include "rdkafka.h"

rd_kafka_resp_err_t callback(void *opaque) {
    printf("callback called with %p\n", opaque);
    return RD_KAFKA_RESP_ERR_NO_ERROR;
}

int main(int argc, char **argv) {
        rd_kafka_conf_res_t err; /* librdkafka API error code */
        char errstr[512];        /* librdkafka API error reporting buffer */

        rd_kafka_conf_t *conf = rd_kafka_conf_new();

        err = rd_kafka_conf_set(conf, "security.protocol", "ssl", errstr, sizeof(errstr));
        assert(err == RD_KAFKA_CONF_OK);

        err = rd_kafka_conf_set(conf, "ssl.ca.location", "/dev/null", errstr, sizeof(errstr));
        assert(err == RD_KAFKA_CONF_OK);

        rd_kafka_resp_err_t rk_err = rd_kafka_conf_interceptor_add_on_conf_destroy(conf, "testing", callback, NULL);
        assert(rk_err == RD_KAFKA_RESP_ERR_NO_ERROR);

        rd_kafka_t *rk = rd_kafka_new(RD_KAFKA_CONSUMER, conf, errstr, sizeof(errstr));
        assert(rk == NULL);
        fprintf(stderr, "%% Failed to create new consumer: %s\n", errstr);

        // Since rk==NULL, the conf is still owned by us, and must be cleaned up to avoid leaks.
        rd_kafka_conf_destroy(conf);

        return 0;
}
EOF
$ make consumer
gcc -g -O2 -fPIC -Wall -Wsign-compare [...] -ldl -lpthread
$ ./consumer
% Failed to create new consumer: ssl.ca.location failed: No further error information available
Segmentation fault: 11

[...]
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000100094950 consumer`rd_kafka_interceptors_on_conf_destroy + 48
    frame #1: 0x000000010002ac9f consumer`rd_kafka_conf_destroy + 15
    frame #2: 0x00000001000030f9 consumer`main(argc=<unavailable>, argv=<unavailable>) at consumer.c:30:9 [opt]
    frame #3: 0x00000001004b152e dyld`start + 462

The problem seems to be that rd_kafka_interceptors_on_conf_destroy is called twice: first (incorrectly) from the failing rd_kafka_new,

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100094920 consumer`rd_kafka_interceptors_on_conf_destroy
    frame #1: 0x000000010002aa3d consumer`rd_kafka_anyconf_destroy + 29
    frame #2: 0x000000010000442b consumer`rd_kafka_destroy_final + 827
    frame #3: 0x00000001000057a6 consumer`rd_kafka_new + 3062
    frame #4: 0x00000001000030cd consumer`main(argc=<unavailable>, argv=<unavailable>) at consumer.c:25:26 [opt]
    frame #5: 0x00000001004b152e dyld`start + 462

and then a second time (correctly) from rd_kafka_conf_destroy.

It appears that rd_kafka_new sometimes returns NULL without destroying the conf, and sometimes returns NULL after destroying the conf (i.e. any codepath that passes through goto fail). So the caller cannot know whether to clean up or not.

This is similar to the "should never happen" bug in #4100, but worse.

Checklist

  • [x] librdkafka version (release number or git tag): 0e4b5512831425dc704582748006aae34e3d8228
  • [x] Apache Kafka version: doesn't matter, Kafka needn't be running
  • [x] librdkafka client configuration: as shown above
  • [x] Operating system: Mac OSX, but probably reproduces everywhere
  • [x] Critical issue

Quuxplusone avatar Jan 10 '23 19:01 Quuxplusone

@Quuxplusone Thanks for the report, this depends on those interceptors being freed here, that leads to a double free. Removing that line will make that destroy interceptor being called when app_conf is destroyed without accessing already freed memory and then destroyed with rd_kafka_anyconf_destroy

emasab avatar Feb 23 '23 13:02 emasab