confluent-kafka-dotnet
confluent-kafka-dotnet copied to clipboard
Metadata refresh blocking consumer
Description
When doing some evaluation of moving to Kraft mode, we discovered an issue when a consumer attempts to refresh topic metadata at a rate greater than every ~15 seconds. When the refresh happens, the consumer will block for ~10 seconds, then consume all of the missed messages and immediately block again for another ~10 seconds. This continues indefinitely if the refresh loop is too tight. If the refresh loop is a little less tight, the same behavior can be seen, but the consumer will periodically break out of the blocking loop and continue consuming as normal until eventually dropping back into the blocking loop again.
We first noticed this in one of our applications where we subscribing to topics via regex. When running in zookeeper mode, the consumers max.metadata.age.ms setting was not being obeyed, so we had a background task using the AdminClient to refresh the metadata manually every second. In zookeeper mode this was working as expected, but after switching to Kraft mode the aforementioned behavior was observed. To limit any outside influences from our application, I wrote the following simple console app and observed the same behavior. I have also dropped into the docker container and run the console consumer using the max.metadata.age.ms setting and not observed the behavior seen in the .Net client.
using System.Collections.Generic;
using Confluent.Kafka;
var config = new ConsumerConfig
{
BootstrapServers = "localhost:9092",
GroupId = "foo",
AutoOffsetReset = AutoOffsetReset.Earliest,
MetadataMaxAgeMs = 10000,
};
using (var consumer = new ConsumerBuilder<Ignore, string>(config).Build())
{
consumer.Subscribe("test-topic-1");
while(true)
{
var consumeResult = consumer.Consume();
Console.WriteLine(string.Format("{0}: {1}", consumeResult.Topic, consumeResult.Message.Value));
}
consumer.Close();
}
We are running the Confluent Kafka image in Docker using the update_run.sh script to override the Zookeeper checks and work in Kraft mode. Docker Compose file:
services:
kafka1:
image: confluentinc/cp-kafka:7.0.0
build:
context: .
dockerfile: Dockerfile
hostname: kafka1
ports:
- 9092:9092
- 29092:29092
- 19092:19092
environment:
KAFKA_BROKER_ID: 1
KAFKA_NODE_ID: 1
KAFKA_CLUSTER_ID: "vlzHauhfTMq2S7cLtnwe6A"
KAFKA_PROCESS_ROLES: 'broker,controller'
KAFKA_CONTROLLER_QUORUM_VOTERS: '1@kafka1:19092'
KAFKA_LISTENERS: 'PLAINTEXT_HOST://0.0.0.0:29092,CONTROLLER://0.0.0.0:19092,PLAINTEXT://0.0.0.0:9092'
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT'
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka1:29092,PLAINTEXT_HOST://kafka1:9092,CONTROLLER://kakfa1:19092
KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
KAFKA_DEFAULT_REPLICATION_FACTOR: 1
KAFKA_MIN_INSYNC_REPLICAS: 1
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'
KAFKA_LOG4J_ROOT_LOGLEVEL: 'ALL'
# KAFKA_LOG4J_LOGGERS: 'kafka.controller=ALL,kafka.producer.async.DefaultEventHandler=ALL,state.change.logger=ALL'
KAFKA_JMX_PORT: 9102
KAFKA_JMX_HOSTNAME: localhost
KAFKA_LOG_DIRS: '/tmp/kraft-combined-logs'
KAFKA_HEAP_OPTS: '-Xmx1G -Xms1G'
volumes:
# - kafka1-data:/tmp/kraft-combined-logs
- ./update_run.sh:/tmp/update_run.sh
command: "bash -c 'if [ ! -f /tmp/update_run.sh ]; then echo \"ERROR: Did you forget the update_run.sh file that came with this docker-compose.yml file?\" && exit 1 ; else /tmp/update_run.sh && /etc/confluent/docker/run ; fi'"
We are using client v1.8.2
Behavior is observed on both Windows and Docker in Linux.
I have not seen anything suspicious in the logs for the broker, but if there is something specific to search for, I can.
How to reproduce
- Run Kafka 3.0 in Kraft mode
- Start producer writing 1 message per second
- Start consumer using the code above, at ~10 second mark the behavior should start happening
Checklist
Please provide the following information:
- [x] A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
- [x] Confluent.Kafka nuget version.
- [x] Apache Kafka version.
- [x] Client configuration.
- [x] Operating system.
- [ ] Provide logs (with "debug" : "..." as necessary in configuration).
- [ ] Provide broker log excerpts.
- [ ] Critical issue.
Can you reproduce this on the latest client version (1.8.2) with Debug: "cgrp,topic,metadata,broker,fetch" set and provide us the logs? Thanks
@edenhill Just wanted to check back in on this and make sure that the logs I provided were the right ones. Otherwise are there any updates on this issue? Thanks!
just triaging this. logs look right. apparent issue with max.metadata.age.ms noted.
KRaft has been marked as production ready in AK 3.3, what is your experience now?
Hi @r-matthew-cline, Thanks for the report. I tried reproducing this with the latest client version (1.9.3), and Kafka with KRaft but I don't see any blocking of the consumer with even low MetadataMaxAgeMs.
Can you please try reproducing again with the latest version and verify it it's working for you now?
@r-matthew-cline it seems there hasn't been a reply for a while. If this issue has solved itself with the latest versions of Kafka or our clients, I can close