KnowStreaming icon indicating copy to clipboard operation
KnowStreaming copied to clipboard

Kafka集群有宕机的Kafka会导致KnowStreaming一直刷异常 Timed out waiting for a node assignment. Call: metadata

Open shirenchuang opened this issue 2 years ago • 5 comments

  • [x] 我已经在 issues 搜索过相关问题了,并没有重复的。

你是否希望来认领这个Bug。

「 Y 」

环境信息

  • KnowStreaming version : 3.0
  • Operating System version : MacOS
  • Java version : 11

重现该问题的步骤

  1. 将KnowStreaming接管的Kafka集群中的一台Kafka关机

  2. 过一会儿就会一直刷Timeout 异常的日志

image

预期结果

实际结果

不应该有这么多日志,或者这么频繁。

初步看这个异常是 请求Metadata的时候抛出来的异常

实际上Metadata应该不会向那个挂掉的Kafka发起请求。

需要找到原因,并解决


如果有异常,请附上异常Trace:


2022-10-19 18:28:06.110 ERROR 44637 --- [1-11-thread-188] c.x.k.s.k.c.s.p.i.PartitionServiceImpl   : class=PartitionServiceImpl||method=getPartitionOffsetFromKafkaAdminClient||clusterPhyId=1||topicName=T4_3P_1R||errMsg=exception!

java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=metadata, deadlineMs=1666175286109, tries=50, nextAllowedTryMs=1666175286210) timed out at 1666175286110 after 50 attempt(s)
	at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
	at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
	at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
	at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionServiceImpl.getPartitionOffsetFromKafkaAdminClient(PartitionServiceImpl.java:309)
	at com.xiaojukeji.know.streaming.km.core.service.version.impl.VersionControlServiceImpl.doHandler(VersionControlServiceImpl.java:90)
	at com.xiaojukeji.know.streaming.km.core.service.version.BaseVersionControlService.doVCHandler(BaseVersionControlService.java:62)
	at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionServiceImpl.getPartitionOffsetFromKafka(PartitionServiceImpl.java:222)
	at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionMetricServiceImpl.getOffsetRelevantMetrics(PartitionMetricServiceImpl.java:180)
	at com.xiaojukeji.know.streaming.km.core.service.version.impl.VersionControlServiceImpl.doHandler(VersionControlServiceImpl.java:90)
	at com.xiaojukeji.know.streaming.km.core.service.version.BaseVersionControlService.doVCHandler(BaseVersionControlService.java:62)
	at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionMetricServiceImpl.collectPartitionsMetricsFromKafka(PartitionMetricServiceImpl.java:146)
	at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionMetricServiceImpl.collectPartitionsMetricsFromKafkaWithCache(PartitionMetricServiceImpl.java:85)
	at com.xiaojukeji.know.streaming.km.collector.metric.PartitionMetricCollector.collectMetrics(PartitionMetricCollector.java:94)
	at com.xiaojukeji.know.streaming.km.collector.metric.PartitionMetricCollector.lambda$collectMetrics$0(PartitionMetricCollector.java:60)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=metadata, deadlineMs=1666175286109, tries=50, nextAllowedTryMs=1666175286210) timed out at 1666175286110 after 50 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: metadata

shirenchuang avatar Oct 19 '22 10:10 shirenchuang

问题排查到的代码块:

KafkaAdminClient在发起请求的时候会先请求 Metadata请求获取元信息。

当kafka集群中存在宕机的broker,Client去获取指定Topic元信息的时候 如果发现这个Topic中的分区存在Leader为-1、的情况

就会往上面抛出异常,抛出异常之后会重新入队(重新发起请求);

所以就会一直发起 Metadata的请求,直到超时

image

image

image

shirenchuang avatar Oct 20 '22 12:10 shirenchuang

只要查询的Topic下面的分区存在Leader = -1 就会出现这种情况

单副本的情况下 比较容易出现

shirenchuang avatar Oct 20 '22 12:10 shirenchuang

关于Kafka客户端吞掉实际异常的这个问题,给Kafka提了一个优化建议: https://issues.apache.org/jira/browse/KAFKA-14328

shirenchuang avatar Oct 21 '22 03:10 shirenchuang

关于Kafka客户端listOffset查询抛异常的这个问题,给Kafka提了一个优化建议 https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14329

shirenchuang avatar Oct 24 '22 08:10 shirenchuang

问下这个bug解决了吗

PXNPXN avatar Mar 29 '23 07:03 PXNPXN