Hey I faced with a weird behaviour, and will be very appreciate for any hints why it may happened!

Description

After updating confluent-kafka from 2.0.2 to 2.1.0, I noticed an increase in consumer lag on production.

It looked like our consumers were doing their job, but not committing offsets.

Consumer's code:

while True:
    messages = consumer.consume(num_messages=100, timeout=60)
    if messages:
        process_messages(messages)
        try:
            consumer.commit()
        except Exception:
            logger.exception("Failed to commit consumer offset")

The docstring of the commit() function says

The message and offsets parameters are mutually exclusive. If neither is set, the current partition assignment's offsets are used instead. Use this method to commit offsets if you have 'enable.auto.commit' set to False.

My code relies on this behaviour. Have something been change recently?

I didn't manage to reproduce this behaviour locally :shrug:

Checklist

Please provide the following information:

[x] confluent-kafka-python and librdkafka version: (2.1.0 / 2.1.0) and (2.1.1 / 2.1.1)
[x] Apache Kafka broker version: 3.3
[x] Client configuration:

{
   "enable.auto.commit": False,
   "bootstrap.servers": ...,
   "group.id": ...,
   "client.id": ...,
   "auto.offset.reset": "earliest",
   "security.protocol": "plaintext",
   "ssl.ca.location": ...,
   "ssl.certificate.location": ...,
   "ssl.key.location": ...
}

[x] Operating system: debian:bullseye-slim
[ ] Provide client logs (with 'debug': '..' as necessary)
[ ] Provide broker log excerpts
[ ] Critical issue

May 11 '23 16:05 erhosen

Did you get any error? Did downgrading the client fix the problem or it got fixed automatically after some time?

May 15 '23 07:05 pranavrth

Did you get any error?

You are asking about raises of a client library? If so, no, there were no errors at all, or we didn't log them. I'll check one more time, and If I found anything - post here.

Did downgrading the client fix the problem or it got fixed automatically after some time?

Yes, downgrading the client fixed the issue. We did it two times, when we deployed 2.1.0 and 2.1.1 - in both cases we downgraded to 2.0.2 and the lag disappeared.

May 15 '23 09:05 erhosen

One more detail - we have multiple consumer_groups but had problem with only one of them

May 15 '23 11:05 erhosen

You are asking about raises of a client library? If so, no, there were no errors at all, or we didn't log them. I'll check one more time, and If I found anything - post here.

I was trying to understand if there is some error and you are not getting the messages itself (I don't know what process_messages function do). In this case, the message error field would be populated. Can you check that as well?

Yes, downgrading the client fixed the issue. We did it two times, when we deployed 2.1.0 and 2.1.1 - in both cases we downgraded to 2.0.2 and the lag disappeared.

Got it

One more detail - we have multiple consumer_groups but had problem with only one of them

Are they all reading from the same topic?

May 17 '23 08:05 pranavrth

(I don't know what process_messages function do)

It's a business logic, we parse the message and update db records. The consumer's code (that posted above) itself is covered by try/except block, and there were no any exceptions 🤷🏻

In this case, the message error field would be populated

We don't look on the error field indeed. Maybe I can log it and deploy one more time. Good thought, thanks!

Are they all reading from the same topic?

No, different topics.

May 19 '23 14:05 erhosen

We don't look on the error field indeed. Maybe I can log it and deploy one more time. Good thought, thanks!

I think this should give us some hint.

Can you also provide debug logs to us by enabling it using "debug"="all" in the client configuration.

May 23 '23 06:05 pranavrth

Same thing here. I experience the consumer lag issue with client version 2.1.1, and message error is not set. This works fine in version 2.0.2. Note I am not using a genuine Kafka cluster, but instead Azure Event Hubs for Apache Kafka.

May 26 '23 15:05 ochedru

Can you provide steps to reproduce locally?

Can you provide debug logs to us by enabling it using "debug"="all" in the client configuration?

May 30 '23 08:05 pranavrth

I will try this when possible.

This may be of interest: I forgot to mention our consumers are continuously using pause and resume on partitions to process messages in batches using an arbitrary number of worker threads, while keeping message ordering by partition (our method is inspired by this article). Looks like there are some consumer fixes on this subject in 2.1.0: https://github.com/confluentinc/librdkafka/releases/tag/v2.1.0...

May 31 '23 14:05 ochedru

Can you provide steps to reproduce locally?

Can you provide debug logs to us by enabling it using "debug"="all" in the client configuration?

Still I would like to have some way to reproduce it in my local or your logs.

Jun 19 '23 08:06 pranavrth

Same thing here.

If consumer.consume(300000, 60) consumer work fine ~10 min, else if consumer.consume(100000, 60) consumer work fine ~2 hours.

After "bug" I got messages in logs like:

cimpl.KafkaException: KafkaError{code=_NO_OFFSET,val=-168,str="Commit failed: Local: No offset stored"}

DEBUG	COMMIT [rdkafka#consumer-1] [thrd:main]: OffsetCommit for 2 partition(s) in join-state steady: manual: returned: Local: No offset stored

config:

'auto.offset.reset': 'earliest',
'session.timeout.ms': 60000,
'max.poll.interval.ms': 300000,
'enable.auto.commit': False,
'isolation.level': 'read_committed',

confluent-kafka-python: 2.1.1 / 2.2.0 Kafka: 3.3.2

Jul 25 '23 10:07 silentsokolov

Apologies, unscientific as I don't have a lot of details, but we have moved from confluent-kafka-python < 2.1.0 to 2.1.0 and 2.2.0. We are seeing single partitions stalling.

For example: in a topic with six partitions and two consumers in the same consumer group; one of the partitions will seemingly 'randomly' stall and may not restart itself for > 12 hours (we normally intervene so have not tried it for longer). We are observing this behaviour anywhere the number of consumers < than the number of partitions. (Its possible it is also happening when there is the same number of consumers as partitions however we have a time-out feature that will automatically restart the consumer and this may be obfuscating the problem).

We are using Python 3.10, confluent-kfakfa-python 2.1.0, and Confluent Cloud.

An example of how this is showing in our monitoring is here: Each line is a consumer group on the same topic.

In all cases where the lag is increasing, it is due to a single partition in the consumer stalling (I did not check if it was the same one on all consumers). Restarting the consumer or adding a new consumer to the group (I assume triggering a rebalance) solves the stall. This is how all the stalls in the chart above were solved.

NB we are not doing anything 'fancy' with our consumers. These are almost 1:1 from the example documentation, and we are not doing any pausing, manual assignment etc. which might interfere with the consumer group's management.

Sep 08 '23 20:09 NickVig

The error disappeared with confluent-kafka==2.3.0 🤷🏻

Nov 16 '23 10:11 erhosen

The error disappeared with confluent-kafka==2.3.0 🤷🏻

Hey bro, so it is safe to use confluent-kafka==2.3.0, right?

Nov 20 '23 06:11 dullberry

@erhosen would it be okay to close this issue if it was resolved with the latest version?

Feb 12 '24 23:02 nhaq-confluent

Yes, thank you :)

Feb 13 '24 09:02 erhosen

Hey bro, so it is safe to use confluent-kafka==2.3.0, right?

@dullberry you should definitely try it out!

Feb 13 '24 09:02 erhosen

Still having the problem on 2.3.0

Feb 29 '24 19:02 mcfriend99

Same problem here, one of the partition randomly stalls. We have 4 partitions and 4 consumers, which all look alive and well.

Mar 06 '24 17:03 romainbossut

@mcfriend99 @romainbossut Please try to move to client v2.4.0 as we have fixed many issues. If this doesn't solve the issue, please raise a new issue as the original issue is not happening with v2.3.0 as per the author of the issue.

Closing the issue.

May 21 '24 14:05 pranavrth

confluent-kafka-python
confluent-kafka-python copied to clipboard

Consumer lag increasing issue on `2.1.0` and `2.1.1` versions

Description

Checklist

confluent-kafka-python confluent-kafka-python copied to clipboard

Consumer lag increasing issue on `2.1.0` and `2.1.1` versions

Description

Checklist

confluent-kafka-python
confluent-kafka-python copied to clipboard