confluent-kafka-go icon indicating copy to clipboard operation
confluent-kafka-go copied to clipboard

Kafka consumer gets stuck after exceeding max.poll.interval.ms

Open im-abeer opened this issue 5 years ago • 38 comments

Description

When the consumer does not receives a message for 5 mins (default value of max.poll.interval.ms 300000ms) the consumer comes to a halt without exiting the program. The consumer process hangs and does not consume any more messages.

The following error message gets logged

MAXPOLL|rdkafka#consumer-1| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 255ms (adjust max.poll.interval.ms for long-running message processing): leaving group

I see that ErrMaxPollExceeded is defined here but unable to find where it is getting raised.

If any such error is raised, why does the program not exit ?

Checklist

Please provide the following information:

  • [X] confluent-kafka-python and librdkafka version (confluent_kafka.version(master) and confluent_kafka.libversion(1.0.0)):
  • [X] Apache Kafka broker version: v1.1.0
  • [X] Client configuration: { "bootstrap.servers": "my.kafka.host", "group.id": "my.group.id", "auto.offset.reset": "earliest", "enable.auto.commit": false }
  • [ ] Operating system:
  • [ ] Provide client logs (with 'debug': '..' as necessary)
  • [ ] Provide broker log excerpts
  • [ ] Critical issue

im-abeer avatar Jun 24 '19 09:06 im-abeer

Unless you are using the channel consumer (which you shouldn't use), you need to call Poll() or ReadMessage() at least every max.poll.interval.ms-1.

edenhill avatar Jun 28 '19 07:06 edenhill

See the "max.poll.interval.ms is enforced" chapter here: https://github.com/edenhill/librdkafka/releases/v1.0.0

edenhill avatar Jun 28 '19 07:06 edenhill

Hello @edenhill, I’m running into a similar issue as the original poster, I’m using a -1 timeout but calling in an infinite loop, e.g.:

for {
        msg, err := consumer.ReadMessage(-1)
        if err != nil {
                <do some logging>
        }
}

My producer stopped writing messages for a few minutes and I logged this:

Application maximum poll interval (300000ms) exceeded by 164ms

Subsequently my producer was back up but the consumer seemed to be hanging on ReadMessage(-1) indefinitely.

According to the doc you linked to:

requires the application to call rd_kafka_consumer_poll()/rd_kafka_poll() at least every max.poll.interval.ms. Failure to do so will make the consumer automatically leave the group […] and not rejoin the group until the application has called ..poll() again

I’d expect that my consumer did indeed leave the group, but the subsequent call to ReadMessage() should have made the consumer rejoin the group and continue to see new messages.


Is this a configuration issue? Would using a shorter timeout for ReadMessage() resolve this?

Or is this a manifestation of https://github.com/edenhill/librdkafka/issues/2266?

librdkafka version: 1.0.0 confluent-kafka-go version: v1.0.0 Client configuration: ["auto.offset.reset": "earliest", "enable.auto.commit": false]

keyan avatar Aug 09 '19 02:08 keyan

This looks like https://github.com/edenhill/librdkafka/pull/2365/commits/80e9b1ee0c0e477c02f18b130c4302240711a88d which is fixed in librdkafka 1.1.0.

edenhill avatar Aug 12 '19 08:08 edenhill

i am having this issue too, how to fix this anyway?

csrgxtu avatar Jun 13 '20 08:06 csrgxtu

I am having this issue with librdkafka 1.5.0, exactly as keyan said. Can anyone help?

szalapski avatar Sep 07 '20 01:09 szalapski

having this this issue as well with v1.4.2. Any hints?

rogaha avatar Oct 31 '20 03:10 rogaha

my code:

for {
		select {
		case <-ctx.Done():
			err = kc.Close()
			if err != nil {
				r.Logger.Error(err)
				panic(err)
			}
			r.Logger.Info("Done")
			return nil
		default:
			msg, err := kc.ReadMessage(100 * time.Millisecond)
			if err == nil {
				if err := p.Perform(ctx, msg.Value); err != nil {
					r.Logger.Error(err)
				}
			}
		}
	}

rogaha avatar Oct 31 '20 03:10 rogaha

To clarify, are you all seeing that your consumer won't rejoin on the subsequent ReadMessage() call? Just seeing the poll interval exceeded message is not abnormal or unexpected.

FWIW, after upgrading to the v1.1.0 client and also changing from a -1 to a sane large timeout, I stopped after rejoining issues. But leaving the consumer group still happens as is expected.

keyan avatar Oct 31 '20 14:10 keyan

Using v1.5.2. Also calling ReadMessage(-1) in an infinite loop, and not seeing rejoining after consumer leaving group, worked around it by setting timeout to be less than max.poll.interval.ms instead of -1, but wondering why it's not rejoining as expected.

jgao54 avatar Dec 15 '20 02:12 jgao54

I'm using v1.5.2 and the issue persists. The worker is "dead" after receiving the "leave group" error:

%4|1614171022.957|MAXPOLL|rdkafka#consumer-10| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 123ms (adjust max.poll.interval.ms for long-running message processing): leaving group
%4|1614171022.960|MAXPOLL|rdkafka#consumer-8| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 481ms (adjust max.poll.interval.ms for long-running message processing): leaving group
%4|1614171022.980|MAXPOLL|rdkafka#consumer-2| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 423ms (adjust max.poll.interval.ms for long-running message processing): leaving group
%4|1614171023.957|MAXPOLL|rdkafka#consumer-3| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 500ms (adjust max.poll.interval.ms for long-running message processing): leaving group
%4|1614171024.447|MAXPOLL|rdkafka#consumer-7| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 84ms (adjust max.poll.interval.ms for long-running message processing): leaving group
%4|1614171025.955|MAXPOLL|rdkafka#consumer-5| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 195ms (adjust max.poll.interval.ms for long-running message processing): leaving group
%4|1614171025.961|MAXPOLL|rdkafka#consumer-1| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 55ms (adjust max.poll.interval.ms for long-running message processing): leaving group

Restarting the worker fixes the problem. It seems to be a queue management issue.

rogaha avatar Feb 24 '21 12:02 rogaha

What means of consuming messages are you using? Channel consumer? ReadMessage? Poll?

edenhill avatar Feb 24 '21 13:02 edenhill

ReadMessage

rogaha avatar Feb 24 '21 17:02 rogaha

And you are calling ReadMessage() more often than the max.poll.interval.ms (30s) ?

edenhill avatar Feb 24 '21 17:02 edenhill

yes:

for {
		select {
		case <-ctx.Done():
			err = kc.Close()
			if err != nil {
				r.Logger.Error(err)
				panic(err)
			}
			r.Logger.Info("Done")
			return nil
		default:
			msg, err := kc.ReadMessage(100 * time.Millisecond)

rogaha avatar Feb 24 '21 18:02 rogaha

@edenhill Why is this issue closed? I'm seeing the same thing as everyone else, also calling ReadMessage quickly in an infinite loop.

I personally see this issue associated with a broker rebalance. On another thread prior to the stuck consumer symptom everyone else reports I see this:

msg="failed to commit message with error: Broker: Group rebalance in progress" consumer=default

gabrielhartmann avatar Jun 03 '21 21:06 gabrielhartmann

I am having the same issue as well

AdbReverse avatar Jun 07 '21 07:06 AdbReverse

There is something weird in the code. Is it possible that we are constantly getting ev=nil return value from Poll()? Then, the function would stuck on an infinite loop

AdbReverse avatar Jun 07 '21 07:06 AdbReverse

Same issue here. If for some reason Poll() has not been called recently, the app receives ErrMaxPollExceeded, as expected. However, once the app resumes calling Poll(), no event is ever returned. The app needs to call consumer.Assign() again to resume processing events.

@edenhill this seems like a bug? It's contrary to what the "max.poll.interval.ms is enforced" chapter here suggests: https://github.com/edenhill/librdkafka/releases/v1.0.0

Failure to do so will make the consumer automatically leave the group, causing a group rebalance, and not rejoin the group until the application has called ..poll() again, triggering yet another group rebalance.

anatolebeuzon avatar Jul 13 '21 09:07 anatolebeuzon

i am having this issue too, how to fix this anyway? "%4|1627976105.775|MAXPOLL|rdkafka#consumer-1| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 29ms (adjust max.poll.interval.ms for long-running message processing): leaving group"

GOODLIFEZW avatar Aug 03 '21 07:08 GOODLIFEZW

+1

askie avatar Oct 20 '21 07:10 askie

+1

guotie avatar Dec 01 '21 01:12 guotie

confluent-kafka-go: 1.5.2

guotie avatar Dec 01 '21 01:12 guotie

+1

bry00 avatar Dec 15 '21 12:12 bry00

Please try to reproduce this on the latest release (1.8.2) with debug=cgrp enabled. Thanks

edenhill avatar Dec 15 '21 12:12 edenhill

I got this same issue on 1.8.2. As this issue came on Production so there I don't have debug=cgrp enabled This is what I just got last night with 1.8.2

%4|1639508773.124|MAXPOLL|rdkafka#consumer-4| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 196ms (adjust max.poll.interval.ms for long-running message processing): leaving group

kashifgrocerkey avatar Dec 15 '21 16:12 kashifgrocerkey

Encountering this too ... randomly appears even after poll of 0.1 ms:

2021-12-27 04:53:05,259 - some error in message: <cimpl.Message object at 0x000001B7B16D9140> code: -147 errorstr: Application maximum poll interval (300000ms) exceeded by 307ms

SoniCoder avatar Dec 27 '21 04:12 SoniCoder

I encounter this error often also, librdkafka 1.7.0. I just hope i can detect this error and restart K8s pod though

namdiag avatar Jan 23 '22 15:01 namdiag

Have also found this issue happening in production as well on 1.8.2. Usually identified when lag starts randomly spiking and pods need restarted. What is strange is that in dev/qa environments that see less traffic and definitely have potential for longer times between messages...I never see this particular error. Only on high throughput environments...

Also, when it does get stuck, it seems like there are no active members...so no clients assigned to partitions. Curious if this is the same behavior others are seeing?

afreeland avatar Mar 21 '22 17:03 afreeland

It would be great with a reproduce with debug=cgrp enabled so we can figure out what is going on. Thanks

edenhill avatar Mar 29 '22 13:03 edenhill