confluent-kafka-dotnet icon indicating copy to clipboard operation
confluent-kafka-dotnet copied to clipboard

Metadata refresh blocking consumer

Open r-matthew-cline opened this issue 3 years ago • 6 comments

Description

When doing some evaluation of moving to Kraft mode, we discovered an issue when a consumer attempts to refresh topic metadata at a rate greater than every ~15 seconds. When the refresh happens, the consumer will block for ~10 seconds, then consume all of the missed messages and immediately block again for another ~10 seconds. This continues indefinitely if the refresh loop is too tight. If the refresh loop is a little less tight, the same behavior can be seen, but the consumer will periodically break out of the blocking loop and continue consuming as normal until eventually dropping back into the blocking loop again.

We first noticed this in one of our applications where we subscribing to topics via regex. When running in zookeeper mode, the consumers max.metadata.age.ms setting was not being obeyed, so we had a background task using the AdminClient to refresh the metadata manually every second. In zookeeper mode this was working as expected, but after switching to Kraft mode the aforementioned behavior was observed. To limit any outside influences from our application, I wrote the following simple console app and observed the same behavior. I have also dropped into the docker container and run the console consumer using the max.metadata.age.ms setting and not observed the behavior seen in the .Net client.

using System.Collections.Generic;
using Confluent.Kafka;

var config = new ConsumerConfig
{
    BootstrapServers = "localhost:9092",
    GroupId = "foo",
    AutoOffsetReset = AutoOffsetReset.Earliest,
    MetadataMaxAgeMs = 10000,
};

using (var consumer = new ConsumerBuilder<Ignore, string>(config).Build())
{
    consumer.Subscribe("test-topic-1");

    while(true)
    {
        var consumeResult = consumer.Consume();

        Console.WriteLine(string.Format("{0}: {1}", consumeResult.Topic, consumeResult.Message.Value));
    }

    consumer.Close();
}

We are running the Confluent Kafka image in Docker using the update_run.sh script to override the Zookeeper checks and work in Kraft mode. Docker Compose file:

services:

  kafka1:
    image: confluentinc/cp-kafka:7.0.0
    build:
      context: .
      dockerfile: Dockerfile
    hostname: kafka1
    ports:
      - 9092:9092
      - 29092:29092
      - 19092:19092
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_NODE_ID: 1
      KAFKA_CLUSTER_ID: "vlzHauhfTMq2S7cLtnwe6A"
      KAFKA_PROCESS_ROLES: 'broker,controller'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@kafka1:19092'
      KAFKA_LISTENERS: 'PLAINTEXT_HOST://0.0.0.0:29092,CONTROLLER://0.0.0.0:19092,PLAINTEXT://0.0.0.0:9092'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT'
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka1:29092,PLAINTEXT_HOST://kafka1:9092,CONTROLLER://kakfa1:19092
      KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_DEFAULT_REPLICATION_FACTOR: 1
      KAFKA_MIN_INSYNC_REPLICAS: 1
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'
      KAFKA_LOG4J_ROOT_LOGLEVEL: 'ALL'
      # KAFKA_LOG4J_LOGGERS: 'kafka.controller=ALL,kafka.producer.async.DefaultEventHandler=ALL,state.change.logger=ALL'
      KAFKA_JMX_PORT: 9102
      KAFKA_JMX_HOSTNAME: localhost
      KAFKA_LOG_DIRS: '/tmp/kraft-combined-logs'
      KAFKA_HEAP_OPTS: '-Xmx1G -Xms1G'
    volumes:
      # - kafka1-data:/tmp/kraft-combined-logs
      - ./update_run.sh:/tmp/update_run.sh
    command: "bash -c 'if [ ! -f /tmp/update_run.sh ]; then echo \"ERROR: Did you forget the update_run.sh file that came with this docker-compose.yml file?\" && exit 1 ; else /tmp/update_run.sh && /etc/confluent/docker/run ; fi'"

We are using client v1.8.2

Behavior is observed on both Windows and Docker in Linux.

I have not seen anything suspicious in the logs for the broker, but if there is something specific to search for, I can.

How to reproduce

  1. Run Kafka 3.0 in Kraft mode
  2. Start producer writing 1 message per second
  3. Start consumer using the code above, at ~10 second mark the behavior should start happening

Checklist

Please provide the following information:

  • [x] A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • [x] Confluent.Kafka nuget version.
  • [x] Apache Kafka version.
  • [x] Client configuration.
  • [x] Operating system.
  • [ ] Provide logs (with "debug" : "..." as necessary in configuration).
  • [ ] Provide broker log excerpts.
  • [ ] Critical issue.

r-matthew-cline avatar Dec 15 '21 14:12 r-matthew-cline

Can you reproduce this on the latest client version (1.8.2) with Debug: "cgrp,topic,metadata,broker,fetch" set and provide us the logs? Thanks

edenhill avatar Dec 15 '21 14:12 edenhill

@edenhill Just wanted to check back in on this and make sure that the logs I provided were the right ones. Otherwise are there any updates on this issue? Thanks!

r-matthew-cline avatar Jan 12 '22 14:01 r-matthew-cline

just triaging this. logs look right. apparent issue with max.metadata.age.ms noted.

KRaft has been marked as production ready in AK 3.3, what is your experience now?

mhowlett avatar Oct 07 '22 18:10 mhowlett

Hi @r-matthew-cline, Thanks for the report. I tried reproducing this with the latest client version (1.9.3), and Kafka with KRaft but I don't see any blocking of the consumer with even low MetadataMaxAgeMs.

Can you please try reproducing again with the latest version and verify it it's working for you now?

anchitj avatar Nov 09 '22 09:11 anchitj

@r-matthew-cline it seems there hasn't been a reply for a while. If this issue has solved itself with the latest versions of Kafka or our clients, I can close

nhaq-confluent avatar Mar 12 '24 11:03 nhaq-confluent