Description

I do not know yet the reason, but after the upgrade of librdkafka to 2.5.0 (without any more changes) the Karafka ecosystem CI crashes more often with:

 *** rdkafka_cgrp.c:3312:rd_kafka_cgrp_terminated: assert: !rd_kafka_assignment_in_progress(rkcg->rkcg_rk) ***
Aborted (core dumped)

on shutdown and the consumer destroy also hangs, which has not happened with 2.4.0.

https://github.com/confluentinc/librdkafka/blob/6eaf89fb124c421b66b43b195879d458a3a31f86/src/rdkafka_cgrp.c#L3312

I recall a different issue a while ago having a similar problem where the solution was to gracefully unsubscribe the consumer prior to the shutdown. I wonder if this would mitigate this as well :thinking:

update: I can now trigger segfaults. Not sure yet exactly why but at least I can crash it on my machine.

How to reproduce

I was unable to reproduce it so far in an isolated environment, and my stress tests on my test setup do not show this behavior. However, all of my previous reports (including fixed) would always have some specs failing on valid issues. I will keep investigating and will provide more details when available.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

[x] librdkafka version (release number or git tag): 2.5.0 (not happening with 2.4.0)
[x] Apache Kafka version: confluentinc/cp-kafka:7.6.1
[x] librdkafka client configuration: absolute defaults
[x] Operating system: ubuntu-latest from Github Actions CI shared runner
[ ] Provide logs (with debug=.. as necessary) from librdkafka (will be provided, trying to repro with logs) - I cannot at this stage because only crashes in CI which runs without log collection (will work on this)
[ ] Provide broker log excerpts - same as above. When CI crashes, VM is shutdown. I will try introducing a mode with heavy debug tracing.
[ ] Critical issue

Jul 22 '24 08:07 mensfeld

After adding an unsubscribe invocation prior to shutdown it seems no longer to cause any issues (as of now).

Jul 22 '24 11:07 mensfeld

I still don't know what the issue is. I mitigated it by doing an unsubscribe + waiting for it to finish (does not finish always) and then after a short wait doing a shutdown. Then it does not emerge.

Aug 05 '24 09:08 mensfeld

update: I can now trigger segfaults. Not sure yet exactly why but at least I can crash it on my machine.

@mensfeld that's great, is it possible to gather debug logs from the run. Not sure if it's related to 2.5.0 or just something that happens with some particular timing.

The assertion suggests that after entering the RD_KAFKA_CGRP_STATE_TERM state one of these variables were increased.

        return rk->rk_consumer.wait_commit_cnt > 0 ||
               rk->rk_consumer.assignment.wait_stop_cnt > 0 ||
               rk->rk_consumer.assignment.pending->cnt > 0 ||
               rk->rk_consumer.assignment.queried->cnt > 0 ||
               rk->rk_consumer.assignment.removed->cnt > 0;

Maybe from the logs it's possible to detect what happens.

Aug 08 '24 14:08 emasab

Thank you @emasab I will dive deeper into this in the upcoming weeks. So far I mitigated this as I described above and I do not see this happening on my rather extensive test suite. When I have some time I will rollback those stability fixes and try to crash it with logs.

Aug 08 '24 16:08 mensfeld

I didn't have time but I see the same in 2.6.0 once every few weeks in production. So far I was not able to replicate this in a stable manner :(

Oct 23 '24 08:10 mensfeld

Increased number of `rd_kafka_cgrp_terminated` with 2.5.0 and shutdown stability degradation

Description

How to reproduce

Checklist