confluent-kafka-python
confluent-kafka-python copied to clipboard
Consumer lag increasing issue on `2.1.0` and `2.1.1` versions
Hey I faced with a weird behaviour, and will be very appreciate for any hints why it may happened!
Description
After updating confluent-kafka
from 2.0.2
to 2.1.0
, I noticed an increase in consumer lag on production.
It looked like our consumers were doing their job, but not committing offsets.
Consumer's code:
while True:
messages = consumer.consume(num_messages=100, timeout=60)
if messages:
process_messages(messages)
try:
consumer.commit()
except Exception:
logger.exception("Failed to commit consumer offset")
The docstring of the commit()
function says
The
message
andoffsets
parameters are mutually exclusive. If neither is set, the current partition assignment's offsets are used instead. Use this method to commit offsets if you have 'enable.auto.commit' set to False.
My code relies on this behaviour. Have something been change recently?
I didn't manage to reproduce this behaviour locally :shrug:
Checklist
Please provide the following information:
- [x] confluent-kafka-python and librdkafka version: (
2.1.0
/2.1.0
) and (2.1.1
/2.1.1
) - [x] Apache Kafka broker version: 3.3
- [x] Client configuration:
{
"enable.auto.commit": False,
"bootstrap.servers": ...,
"group.id": ...,
"client.id": ...,
"auto.offset.reset": "earliest",
"security.protocol": "plaintext",
"ssl.ca.location": ...,
"ssl.certificate.location": ...,
"ssl.key.location": ...
}
- [x] Operating system:
debian:bullseye-slim
- [ ] Provide client logs (with
'debug': '..'
as necessary) - [ ] Provide broker log excerpts
- [ ] Critical issue
Did you get any error? Did downgrading the client fix the problem or it got fixed automatically after some time?
Did you get any error?
You are asking about raise
s of a client library? If so, no, there were no errors at all, or we didn't log them.
I'll check one more time, and If I found anything - post here.
Did downgrading the client fix the problem or it got fixed automatically after some time?
Yes, downgrading the client fixed the issue. We did it two times, when we deployed 2.1.0
and 2.1.1
- in both cases we downgraded to 2.0.2
and the lag disappeared.
One more detail - we have multiple consumer_groups
but had problem with only one of them
You are asking about raises of a client library? If so, no, there were no errors at all, or we didn't log them. I'll check one more time, and If I found anything - post here.
I was trying to understand if there is some error and you are not getting the messages itself (I don't know what process_messages function do). In this case, the message error field would be populated. Can you check that as well?
Yes, downgrading the client fixed the issue. We did it two times, when we deployed 2.1.0 and 2.1.1 - in both cases we downgraded to 2.0.2 and the lag disappeared.
Got it
One more detail - we have multiple consumer_groups but had problem with only one of them
Are they all reading from the same topic?
(I don't know what process_messages function do)
It's a business logic, we parse the message and update db records. The consumer's code (that posted above) itself is covered by try/except block, and there were no any exceptions 🤷🏻
In this case, the message error field would be populated
We don't look on the error
field indeed. Maybe I can log it and deploy one more time. Good thought, thanks!
Are they all reading from the same topic?
No, different topics.
We don't look on the error field indeed. Maybe I can log it and deploy one more time. Good thought, thanks!
I think this should give us some hint.
Can you also provide debug logs to us by enabling it using "debug"="all"
in the client configuration.
Same thing here. I experience the consumer lag issue with client version 2.1.1, and message error
is not set.
This works fine in version 2.0.2.
Note I am not using a genuine Kafka cluster, but instead Azure Event Hubs for Apache Kafka.
Can you provide steps to reproduce locally?
Can you provide debug logs to us by enabling it using "debug"="all"
in the client configuration?
I will try this when possible.
This may be of interest: I forgot to mention our consumers are continuously using pause
and resume
on partitions to process messages in batches using an arbitrary number of worker threads, while keeping message ordering by partition (our method is inspired by this article).
Looks like there are some consumer fixes on this subject in 2.1.0: https://github.com/confluentinc/librdkafka/releases/tag/v2.1.0...
Can you provide steps to reproduce locally?
Can you provide debug logs to us by enabling it using "debug"="all" in the client configuration?
Still I would like to have some way to reproduce it in my local or your logs.
Same thing here.
If consumer.consume(300000, 60)
consumer work fine ~10 min, else if consumer.consume(100000, 60)
consumer work fine ~2 hours.
After "bug" I got messages in logs like:
cimpl.KafkaException: KafkaError{code=_NO_OFFSET,val=-168,str="Commit failed: Local: No offset stored"}
DEBUG COMMIT [rdkafka#consumer-1] [thrd:main]: OffsetCommit for 2 partition(s) in join-state steady: manual: returned: Local: No offset stored
config:
'auto.offset.reset': 'earliest',
'session.timeout.ms': 60000,
'max.poll.interval.ms': 300000,
'enable.auto.commit': False,
'isolation.level': 'read_committed',
confluent-kafka-python: 2.1.1 / 2.2.0 Kafka: 3.3.2
Apologies, unscientific as I don't have a lot of details, but we have moved from confluent-kafka-python < 2.1.0 to 2.1.0 and 2.2.0. We are seeing single partitions stalling.
For example: in a topic with six partitions and two consumers in the same consumer group; one of the partitions will seemingly 'randomly' stall and may not restart itself for > 12 hours (we normally intervene so have not tried it for longer). We are observing this behaviour anywhere the number of consumers < than the number of partitions. (Its possible it is also happening when there is the same number of consumers as partitions however we have a time-out feature that will automatically restart the consumer and this may be obfuscating the problem).
We are using Python 3.10, confluent-kfakfa-python 2.1.0, and Confluent Cloud.
An example of how this is showing in our monitoring is here:
Each line is a consumer group on the same topic.
In all cases where the lag is increasing, it is due to a single partition in the consumer stalling (I did not check if it was the same one on all consumers). Restarting the consumer or adding a new consumer to the group (I assume triggering a rebalance) solves the stall. This is how all the stalls in the chart above were solved.
NB we are not doing anything 'fancy' with our consumers. These are almost 1:1 from the example documentation, and we are not doing any pausing, manual assignment etc. which might interfere with the consumer group's management.
The error disappeared with confluent-kafka==2.3.0
🤷🏻
The error disappeared with
confluent-kafka==2.3.0
🤷🏻
Hey bro, so it is safe to use confluent-kafka==2.3.0
, right?
@erhosen would it be okay to close this issue if it was resolved with the latest version?
Yes, thank you :)
Hey bro, so it is safe to use confluent-kafka==2.3.0, right?
@dullberry you should definitely try it out!
Still having the problem on 2.3.0
Same problem here, one of the partition randomly stalls. We have 4 partitions and 4 consumers, which all look alive and well.
@mcfriend99 @romainbossut Please try to move to client v2.4.0 as we have fixed many issues. If this doesn't solve the issue, please raise a new issue as the original issue is not happening with v2.3.0 as per the author of the issue.
Closing the issue.