confluent-kafka-python icon indicating copy to clipboard operation
confluent-kafka-python copied to clipboard

Consumer lag increasing issue on `2.1.0` and `2.1.1` versions

Open erhosen opened this issue 1 year ago • 19 comments

Hey I faced with a weird behaviour, and will be very appreciate for any hints why it may happened!

Description

After updating confluent-kafka from 2.0.2 to 2.1.0, I noticed an increase in consumer lag on production.

Screenshot 2023-05-11 at 17 02 58

It looked like our consumers were doing their job, but not committing offsets.

Consumer's code:

while True:
    messages = consumer.consume(num_messages=100, timeout=60)
    if messages:
        process_messages(messages)
        try:
            consumer.commit()
        except Exception:
            logger.exception("Failed to commit consumer offset")

The docstring of the commit() function says

The message and offsets parameters are mutually exclusive. If neither is set, the current partition assignment's offsets are used instead. Use this method to commit offsets if you have 'enable.auto.commit' set to False.

My code relies on this behaviour. Have something been change recently?

I didn't manage to reproduce this behaviour locally :shrug:

Checklist

Please provide the following information:

  • [x] confluent-kafka-python and librdkafka version: (2.1.0 / 2.1.0) and (2.1.1 / 2.1.1)
  • [x] Apache Kafka broker version: 3.3
  • [x] Client configuration:
{
   "enable.auto.commit": False,
   "bootstrap.servers": ...,
   "group.id": ...,
   "client.id": ...,
   "auto.offset.reset": "earliest",
   "security.protocol": "plaintext",
   "ssl.ca.location": ...,
   "ssl.certificate.location": ...,
   "ssl.key.location": ...
}
  • [x] Operating system: debian:bullseye-slim
  • [ ] Provide client logs (with 'debug': '..' as necessary)
  • [ ] Provide broker log excerpts
  • [ ] Critical issue

erhosen avatar May 11 '23 16:05 erhosen

Did you get any error? Did downgrading the client fix the problem or it got fixed automatically after some time?

pranavrth avatar May 15 '23 07:05 pranavrth

Did you get any error?

You are asking about raises of a client library? If so, no, there were no errors at all, or we didn't log them. I'll check one more time, and If I found anything - post here.

Did downgrading the client fix the problem or it got fixed automatically after some time?

Yes, downgrading the client fixed the issue. We did it two times, when we deployed 2.1.0 and 2.1.1 - in both cases we downgraded to 2.0.2 and the lag disappeared.

erhosen avatar May 15 '23 09:05 erhosen

One more detail - we have multiple consumer_groups but had problem with only one of them image

erhosen avatar May 15 '23 11:05 erhosen

You are asking about raises of a client library? If so, no, there were no errors at all, or we didn't log them. I'll check one more time, and If I found anything - post here.

I was trying to understand if there is some error and you are not getting the messages itself (I don't know what process_messages function do). In this case, the message error field would be populated. Can you check that as well?

Yes, downgrading the client fixed the issue. We did it two times, when we deployed 2.1.0 and 2.1.1 - in both cases we downgraded to 2.0.2 and the lag disappeared.

Got it

One more detail - we have multiple consumer_groups but had problem with only one of them

Are they all reading from the same topic?

pranavrth avatar May 17 '23 08:05 pranavrth

(I don't know what process_messages function do)

It's a business logic, we parse the message and update db records. The consumer's code (that posted above) itself is covered by try/except block, and there were no any exceptions 🤷🏻

In this case, the message error field would be populated

We don't look on the error field indeed. Maybe I can log it and deploy one more time. Good thought, thanks!

Are they all reading from the same topic?

No, different topics.

erhosen avatar May 19 '23 14:05 erhosen

We don't look on the error field indeed. Maybe I can log it and deploy one more time. Good thought, thanks!

I think this should give us some hint.

Can you also provide debug logs to us by enabling it using "debug"="all" in the client configuration.

pranavrth avatar May 23 '23 06:05 pranavrth

Same thing here. I experience the consumer lag issue with client version 2.1.1, and message error is not set. This works fine in version 2.0.2. Note I am not using a genuine Kafka cluster, but instead Azure Event Hubs for Apache Kafka.

ochedru avatar May 26 '23 15:05 ochedru

Can you provide steps to reproduce locally?

Can you provide debug logs to us by enabling it using "debug"="all" in the client configuration?

pranavrth avatar May 30 '23 08:05 pranavrth

I will try this when possible.

This may be of interest: I forgot to mention our consumers are continuously using pause and resume on partitions to process messages in batches using an arbitrary number of worker threads, while keeping message ordering by partition (our method is inspired by this article). Looks like there are some consumer fixes on this subject in 2.1.0: https://github.com/confluentinc/librdkafka/releases/tag/v2.1.0...

ochedru avatar May 31 '23 14:05 ochedru

Can you provide steps to reproduce locally?

Can you provide debug logs to us by enabling it using "debug"="all" in the client configuration?

Still I would like to have some way to reproduce it in my local or your logs.

pranavrth avatar Jun 19 '23 08:06 pranavrth

Same thing here.

If consumer.consume(300000, 60) consumer work fine ~10 min, else if consumer.consume(100000, 60) consumer work fine ~2 hours.

After "bug" I got messages in logs like:

cimpl.KafkaException: KafkaError{code=_NO_OFFSET,val=-168,str="Commit failed: Local: No offset stored"}
DEBUG	COMMIT [rdkafka#consumer-1] [thrd:main]: OffsetCommit for 2 partition(s) in join-state steady: manual: returned: Local: No offset stored

config:

'auto.offset.reset': 'earliest',
'session.timeout.ms': 60000,
'max.poll.interval.ms': 300000,
'enable.auto.commit': False,
'isolation.level': 'read_committed',

confluent-kafka-python: 2.1.1 / 2.2.0 Kafka: 3.3.2

silentsokolov avatar Jul 25 '23 10:07 silentsokolov

Apologies, unscientific as I don't have a lot of details, but we have moved from confluent-kafka-python < 2.1.0 to 2.1.0 and 2.2.0. We are seeing single partitions stalling.

For example: in a topic with six partitions and two consumers in the same consumer group; one of the partitions will seemingly 'randomly' stall and may not restart itself for > 12 hours (we normally intervene so have not tried it for longer). We are observing this behaviour anywhere the number of consumers < than the number of partitions. (Its possible it is also happening when there is the same number of consumers as partitions however we have a time-out feature that will automatically restart the consumer and this may be obfuscating the problem).

We are using Python 3.10, confluent-kfakfa-python 2.1.0, and Confluent Cloud.

An example of how this is showing in our monitoring is here: image Each line is a consumer group on the same topic.

In all cases where the lag is increasing, it is due to a single partition in the consumer stalling (I did not check if it was the same one on all consumers). Restarting the consumer or adding a new consumer to the group (I assume triggering a rebalance) solves the stall. This is how all the stalls in the chart above were solved.

NB we are not doing anything 'fancy' with our consumers. These are almost 1:1 from the example documentation, and we are not doing any pausing, manual assignment etc. which might interfere with the consumer group's management.

NickVig avatar Sep 08 '23 20:09 NickVig

The error disappeared with confluent-kafka==2.3.0 🤷🏻

erhosen avatar Nov 16 '23 10:11 erhosen

The error disappeared with confluent-kafka==2.3.0 🤷🏻

Hey bro, so it is safe to use confluent-kafka==2.3.0, right?

dullberry avatar Nov 20 '23 06:11 dullberry

@erhosen would it be okay to close this issue if it was resolved with the latest version?

nhaq-confluent avatar Feb 12 '24 23:02 nhaq-confluent

Yes, thank you :)

erhosen avatar Feb 13 '24 09:02 erhosen

Hey bro, so it is safe to use confluent-kafka==2.3.0, right?

@dullberry you should definitely try it out!

erhosen avatar Feb 13 '24 09:02 erhosen

Still having the problem on 2.3.0

mcfriend99 avatar Feb 29 '24 19:02 mcfriend99

Same problem here, one of the partition randomly stalls. We have 4 partitions and 4 consumers, which all look alive and well.

romainbossut avatar Mar 06 '24 17:03 romainbossut

@mcfriend99 @romainbossut Please try to move to client v2.4.0 as we have fixed many issues. If this doesn't solve the issue, please raise a new issue as the original issue is not happening with v2.3.0 as per the author of the issue.

Closing the issue.

pranavrth avatar May 21 '24 14:05 pranavrth