KnowStreaming
KnowStreaming copied to clipboard
Kafka集群有宕机的Kafka会导致KnowStreaming一直刷异常 Timed out waiting for a node assignment. Call: metadata
- [x] 我已经在 issues 搜索过相关问题了,并没有重复的。
你是否希望来认领这个Bug。
「 Y 」
环境信息
- KnowStreaming version : 3.0
- Operating System version : MacOS
- Java version : 11
重现该问题的步骤
-
将KnowStreaming接管的Kafka集群中的一台Kafka关机
-
过一会儿就会一直刷Timeout 异常的日志

预期结果
实际结果
不应该有这么多日志,或者这么频繁。
初步看这个异常是 请求Metadata的时候抛出来的异常
实际上Metadata应该不会向那个挂掉的Kafka发起请求。
需要找到原因,并解决
如果有异常,请附上异常Trace:
2022-10-19 18:28:06.110 ERROR 44637 --- [1-11-thread-188] c.x.k.s.k.c.s.p.i.PartitionServiceImpl : class=PartitionServiceImpl||method=getPartitionOffsetFromKafkaAdminClient||clusterPhyId=1||topicName=T4_3P_1R||errMsg=exception!
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=metadata, deadlineMs=1666175286109, tries=50, nextAllowedTryMs=1666175286210) timed out at 1666175286110 after 50 attempt(s)
at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionServiceImpl.getPartitionOffsetFromKafkaAdminClient(PartitionServiceImpl.java:309)
at com.xiaojukeji.know.streaming.km.core.service.version.impl.VersionControlServiceImpl.doHandler(VersionControlServiceImpl.java:90)
at com.xiaojukeji.know.streaming.km.core.service.version.BaseVersionControlService.doVCHandler(BaseVersionControlService.java:62)
at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionServiceImpl.getPartitionOffsetFromKafka(PartitionServiceImpl.java:222)
at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionMetricServiceImpl.getOffsetRelevantMetrics(PartitionMetricServiceImpl.java:180)
at com.xiaojukeji.know.streaming.km.core.service.version.impl.VersionControlServiceImpl.doHandler(VersionControlServiceImpl.java:90)
at com.xiaojukeji.know.streaming.km.core.service.version.BaseVersionControlService.doVCHandler(BaseVersionControlService.java:62)
at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionMetricServiceImpl.collectPartitionsMetricsFromKafka(PartitionMetricServiceImpl.java:146)
at com.xiaojukeji.know.streaming.km.core.service.partition.impl.PartitionMetricServiceImpl.collectPartitionsMetricsFromKafkaWithCache(PartitionMetricServiceImpl.java:85)
at com.xiaojukeji.know.streaming.km.collector.metric.PartitionMetricCollector.collectMetrics(PartitionMetricCollector.java:94)
at com.xiaojukeji.know.streaming.km.collector.metric.PartitionMetricCollector.lambda$collectMetrics$0(PartitionMetricCollector.java:60)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=metadata, deadlineMs=1666175286109, tries=50, nextAllowedTryMs=1666175286210) timed out at 1666175286110 after 50 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: metadata
问题排查到的代码块:
KafkaAdminClient在发起请求的时候会先请求 Metadata请求获取元信息。
当kafka集群中存在宕机的broker,Client去获取指定Topic元信息的时候 如果发现这个Topic中的分区存在Leader为-1、的情况
就会往上面抛出异常,抛出异常之后会重新入队(重新发起请求);
所以就会一直发起 Metadata的请求,直到超时
只要查询的Topic下面的分区存在Leader = -1 就会出现这种情况
单副本的情况下 比较容易出现
关于Kafka客户端吞掉实际异常的这个问题,给Kafka提了一个优化建议: https://issues.apache.org/jira/browse/KAFKA-14328
关于Kafka客户端listOffset查询抛异常的这个问题,给Kafka提了一个优化建议 https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14329
问下这个bug解决了吗