librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

Consumer group stuck in `Join -> Sync -> SyncError -> Join` loop (`v1.5.3`)

Open ArneSchulze opened this issue 3 years ago • 1 comments

Description

Hi, I need some help investigating weird behavior occurring in one of the older librdkafka versions (specifically librdkafka:1.5.3). We are using librdkafka as part of node-rdkafka (a Node.js binding to librdkafka). Unfortunately we currently can not upgrade node-rdkafka to a more recent version and thus also can not try out a more recent librdkafka version.

All of the consumer groups we are using in our deployment are sometimes experiencing weird rebalancing issues, which lead to a halt in processing for minutes up to hours if the affected service isn't scaled down to a single instance (only one consumer in the consumer group). For the vast majority of the time the service works as intended. Rebalancings are happening fast and as expected, but every once in a while our consumers get stuck in a Join -> Sync -> SyncError -> Join loop that lasts for hours up to days and can only be avoided when scaling the number of consumers in the consumer group down to one instance. Even if all consumers are taken offline at the same time for multiple minutes and then started again, they still get caught in the loop. After a couple of hours the issue usually miraculously disappears.

When investigating the issue further, enabling full debug output in the affected consumers, we took the followings snapshot of the logs (I removed potentially sensitive information but tried to keep relations, e.g. <broker-1> is always the same broker node id):

16.09+0100 OUT [thrd:main]: Group "<consumer-group-id>": "range" assignor run for 6 member(s)
16.09+0100 OUT [thrd:main]: Group "<consumer-group-id>" changed join state wait-metadata -> wait-sync (v1, state up)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 0+60427/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 60427+22021/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 82448+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 98832+16648/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 115480+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 131864+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 148248+16780/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 165028+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 181412+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 197796+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 214180+25170/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 239350+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 255734+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 272118+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 288502+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 304886+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 321270+29563/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 350833+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 367217+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 383601+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 399985+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 416369+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 432753+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 449137+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 465521+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 481905+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 498289+19768/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 518057+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 534441+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 550825+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 567209+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 583593+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial SyncGroupRequest (v3, 599977+16384/624057 bytes, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent SyncGroupRequest (v3, 624057 bytes @ 616361, CorrId 22)
16.09+0100 OUT [thrd:GroupCoordinator]: Broadcasting state change
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Received SyncGroupResponse (v3, 10 bytes, CorrId 22, rtt 14.71ms)
16.10+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>: SyncGroupRequest failed: Unknown broker error: actions Permanent
16.10+0100 OUT [thrd:main]: SyncGroup response: Unknown broker error (0 bytes of MemberState data)
16.10+0100 OUT [thrd:main]: Group "<consumer-group-id>": synchronization failed: Unknown broker error: rejoining
16.10+0100 OUT [thrd:main]: Group "<consumer-group-id>" changed join state wait-sync -> init (v1, state up)
16.10+0100 OUT [thrd:main]: Group "<consumer-group-id>": join with <topic-number> (1) subscribed topic(s)
16.10+0100 OUT [thrd:main]: consumer join: metadata for wildcard subscription is up to date (5890ms old)
16.10+0100 OUT [thrd:main]: sasl_ssl://<coordinator-1-ip>/<broker-1>: Joining group "<consumer-group-id>" with <topic-number subscribed topic(s)
16.10+0100 OUT [thrd:main]: Group "<consumer-group-id>" changed join state init -> wait-join (v1, state up)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 0+16640/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 16640+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 33024+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 49408+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 65792+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 82176+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 98560+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 114944+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 131328+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 147712+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial JoinGroupRequest (v5, 164096+16384/183400 bytes, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent JoinGroupRequest (v5, 183400 bytes @ 180480, CorrId 23)
16.10+0100 OUT [thrd:GroupCoordinator]: Broadcasting state change
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Received JoinGroupResponse (v5, 550289 bytes, CorrId 23, rtt 576.51ms)
16.10+0100 OUT [thrd:main]: JoinGroup response: GenerationId 34500, Protocol range, LeaderId 8a7298d1-2b18-43ad-6bc3-8b59-b6b00c8e-2cfc-4cc4-9a2a-add88251021f (me), my MemberId 8a7298d1-2b18-43ad-6bc3-8b59-b6b00c8e-2cfc-4cc4-9a2a-add88251021f, 6 members in group: (no error)
16.10+0100 OUT [thrd:main]: Elected leader for group "<consumer-group-id>" with 6 member(s)
16.10+0100 OUT [thrd:main]: Group "<consumer-group-id>": resetting group leader info: JoinGroup response clean-up
16.10+0100 OUT [thrd:main]: Group "<consumer-group-id>" changed join state wait-join -> wait-metadata (v1, state up)
16.10+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>: Request metadata for <topic-number> topic(s): partition assignor
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial MetadataRequest (v4, 0+60670/91633 bytes, CorrId 24)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent partial MetadataRequest (v4, 60670+22143/91633 bytes, CorrId 24)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Sent MetadataRequest (v4, 91633 bytes @ 82813, CorrId 24)
16.10+0100 OUT [thrd:GroupCoordinator]: GroupCoordinator/<broker-1>: Received MetadataResponse (v4, 662227 bytes, CorrId 24, rtt 60.58ms)
16.10+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>: ===== Received metadata (for <topic-number> requested topics): partition assignor =====
16.10+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>: ClusterId: <cluster-id>, ControllerId: 103
16.10+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>: <broker-number> brokers, <topic-number> topics
16.10+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>:   Broker #x/<broker-number>: <broker-1-ip> NodeId <broker-1>
16.11+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>:   Broker #y/<broker-number>: <broker-2-ip> NodeId <broker-2>
16.11+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>:   Broker #z/<broker-number>: <broker-3-ip> NodeId <broker-3>
16.16+0100 OUT [thrd:main]: GroupCoordinator/<broker-1>:   Topic #a/<topic-number>: <topic-name> with <topic-partition-number> partitions

The whole log output is basically the above snippet repeated conutless of times (with additional metadata and assignor output).

I also raised the issue with our kafka service provider before posting this issue, since (from the log output) it seems to me like the server is responding incorrectly. They responded that they couldn't find anything unusual with the kafka cluster and instead suggested to raise the issue with the library we are using. Because the issue itself does not seem to touch the node-rdkafka code, I skipped posting the issue there and directly came here.

Do you have any idea what might be causing this? Or can you point me in any direction where to investigate further?

We are using static consumer groups with SASL and the regex subscription feature.

How to reproduce

Unfortunately I haven't found out how to reliably trigger the issue yet, but it keeps coming up once in a while.
I would really like to try if the issue still exists on the latest released version of librdkafka, but unfortunately we currently can not upgrade the version we are using.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • [x] librdkafka version (release number or git tag): v1.5.3
  • [x] Apache Kafka version: 2.5.1
  • [x] librdkafka client configuration: <fetch.wait.max.ms=100,enable.auto.commit=false,queued.max.messages.kbytes=61440,group.instance.id=...,sasl.mechanism=...,session.timeout.ms=60000,topic.metadata.refresh.interval.ms=30000>
  • [x] Operating system: Debian "bullseye" x64
  • [x] Provide logs (with debug=.. as necessary) from librdkafka
  • [ ] Provide broker log excerpts (Unfortunately I can't)
  • [ ] Critical issue

ArneSchulze avatar Jun 01 '22 07:06 ArneSchulze

librdkafka v1.5.3 is quite old and there have been a large number of fixes since then. Additionally, we only provide free support for the latest version and thus ask you to reproduce this on v1.8.2 (or master branch).

edenhill avatar Jun 01 '22 09:06 edenhill