Increased number of `rd_kafka_cgrp_terminated` with 2.5.0 and shutdown stability degradation
Description
I do not know yet the reason, but after the upgrade of librdkafka to 2.5.0 (without any more changes) the Karafka ecosystem CI crashes more often with:
*** rdkafka_cgrp.c:3312:rd_kafka_cgrp_terminated: assert: !rd_kafka_assignment_in_progress(rkcg->rkcg_rk) ***
Aborted (core dumped)
on shutdown and the consumer destroy also hangs, which has not happened with 2.4.0.
https://github.com/confluentinc/librdkafka/blob/6eaf89fb124c421b66b43b195879d458a3a31f86/src/rdkafka_cgrp.c#L3312
I recall a different issue a while ago having a similar problem where the solution was to gracefully unsubscribe the consumer prior to the shutdown. I wonder if this would mitigate this as well :thinking:
update: I can now trigger segfaults. Not sure yet exactly why but at least I can crash it on my machine.
How to reproduce
I was unable to reproduce it so far in an isolated environment, and my stress tests on my test setup do not show this behavior. However, all of my previous reports (including fixed) would always have some specs failing on valid issues. I will keep investigating and will provide more details when available.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
- [x] librdkafka version (release number or git tag):
2.5.0(not happening with2.4.0) - [x] Apache Kafka version:
confluentinc/cp-kafka:7.6.1 - [x] librdkafka client configuration: absolute defaults
- [x] Operating system: ubuntu-latest from Github Actions CI shared runner
- [ ] Provide logs (with
debug=..as necessary) from librdkafka (will be provided, trying to repro with logs) - I cannot at this stage because only crashes in CI which runs without log collection (will work on this) - [ ] Provide broker log excerpts - same as above. When CI crashes, VM is shutdown. I will try introducing a mode with heavy debug tracing.
- [ ] Critical issue
After adding an unsubscribe invocation prior to shutdown it seems no longer to cause any issues (as of now).
I still don't know what the issue is. I mitigated it by doing an unsubscribe + waiting for it to finish (does not finish always) and then after a short wait doing a shutdown. Then it does not emerge.
update: I can now trigger segfaults. Not sure yet exactly why but at least I can crash it on my machine.
@mensfeld that's great, is it possible to gather debug logs from the run. Not sure if it's related to 2.5.0 or just something that happens with some particular timing.
The assertion suggests that after entering the RD_KAFKA_CGRP_STATE_TERM state one of these variables were increased.
return rk->rk_consumer.wait_commit_cnt > 0 ||
rk->rk_consumer.assignment.wait_stop_cnt > 0 ||
rk->rk_consumer.assignment.pending->cnt > 0 ||
rk->rk_consumer.assignment.queried->cnt > 0 ||
rk->rk_consumer.assignment.removed->cnt > 0;
Maybe from the logs it's possible to detect what happens.
Thank you @emasab I will dive deeper into this in the upcoming weeks. So far I mitigated this as I described above and I do not see this happening on my rather extensive test suite. When I have some time I will rollback those stability fixes and try to crash it with logs.
I didn't have time but I see the same in 2.6.0 once every few weeks in production. So far I was not able to replicate this in a stable manner :(