starrocks-kubernetes-operator icon indicating copy to clipboard operation
starrocks-kubernetes-operator copied to clipboard

Flink starrocks stream JOB found restarting due to Connection fe timed out

Open zbingwen opened this issue 1 year ago • 16 comments

Describe the bug

Flink starrocks stream JOB found restarting due to Connection fe timed out

Please complete the following information

  • Operator Version: [ v1.8.6]
  • starrocks 3.1.3

log:

2023-11-23 20:14:34
java.io.IOException: Could not perform checkpoint 2172 for operator Sink: sink_2_starrocks_job[3] (3/3)#126.
	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1256)
	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierTracker.processBarrier(CheckpointBarrierTracker.java:107)
	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
	at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:118)
	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
	at java.lang.Thread.run(Thread.java:750)
	Suppressed: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
		at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
		at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
		at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.close(StarRocksDynamicSinkFunctionV2.java:285)
		at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
		at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:115)
		at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163)
		at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125)
		at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1043)
		at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255)
		at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72)
		at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
		at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:951)
		at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$0(Task.java:934)
		at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
		at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:934)
		... 3 more
	Caused by: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
		at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:349)
		at com.starrocks.data.load.stream.DefaultStreamLoader.lambda$send$3(DefaultStreamLoader.java:172)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
		... 1 more
	Caused by: com.starrocks.streamload.shade.org.apache.http.conn.HttpHostConnectException: Connect to starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local:8030 [starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local/10.132.225.119] failed: Connection timed out (Connection timed out)
		at com.starrocks.streamload.shade.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
		at com.starrocks.streamload.shade.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
		at com.starrocks.streamload.shade.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
		at com.starrocks.streamload.shade.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
		at com.starrocks.streamload.shade.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
		at com.starrocks.streamload.shade.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
		at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:304)
		... 7 more
	Caused by: java.net.ConnectException: Connection timed out (Connection timed out)
		at java.net.PlainSocketImpl.socketConnect(Native Method)
		at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
		at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
		at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
		at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
		at java.net.Socket.connect(Socket.java:607)
		at com.starrocks.streamload.shade.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
		at com.starrocks.streamload.shade.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
		... 17 more
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 2172 for operator Sink: sink_2_starrocks_job[3] (3/3)#126. Failure reason: Checkpoint was declined.
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:269)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:173)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:336)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.checkpointStreamOperator(RegularOperatorChain.java:228)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.buildOperatorSnapshotFutures(RegularOperatorChain.java:213)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:192)
	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:715)
	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:350)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$14(StreamTask.java:1299)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1287)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1244)
	... 15 more
Caused by: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
	at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
	at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
	at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.snapshotState(StarRocksDynamicSinkFunctionV2.java:298)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:88)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:222)
	... 26 more
Caused by: [CIRCULAR REFERENCE: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265]

zbingwen avatar Nov 23 '23 12:11 zbingwen

I found below information from fe-leader:

-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,045 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,467 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,589 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,466 WARN (thrift-server-pool-6|181) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,467 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,589 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:20,468 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:20,590 WARN (thrift-server-pool-150|484) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:21,468 WARN (thrift-server-pool-6|181) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:21,590 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,019 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,020 WARN (thrift-server-pool-150|484) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,469 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,591 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,003 WARN (heartbeat-mgr-pool-1|129) [HeartbeatMgr$BackendHeartbeatHandler.call():315] backend heartbeat got exception, addr: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local:9050
org.apache.thrift.transport.TTransportException: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
        at org.apache.thrift.transport.TSocket.open(TSocket.java:226) ~[libthrift-0.13.0.jar:0.13.0]
        at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:144) ~[starrocks-fe.jar:?]
        at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:129) ~[starrocks-fe.jar:?]
        at org.apache.commons.pool2.BaseKeyedPooledObjectFactory.makeObject(BaseKeyedPooledObjectFactory.java:62) ~[commons-pool2-2.3.jar:2.3]
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1036) ~[commons-pool2-2.3.jar:2.3]
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:356) ~[commons-pool2-2.3.jar:2.3]
        at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:278) ~[commons-pool2-2.3.jar:2.3]
        at com.starrocks.common.GenericPool.borrowObject(GenericPool.java:101) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:265) ~[starrocks-fe.jar:?]
        at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:251) ~[starrocks-fe.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:229) ~[?:?]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
        at java.net.Socket.connect(Socket.java:609) ~[?:?]
        at org.apache.thrift.transport.TSocket.open(TSocket.java:221) ~[libthrift-0.13.0.jar:0.13.0]
        ... 13 more
2023-11-22 11:06:23,004 WARN (heartbeat mgr|13) [HeartbeatMgr.runAfterCatalogReady():164] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,469 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,591 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local

zbingwen avatar Nov 23 '23 12:11 zbingwen

@kevincai Please help me , thank you in advance.

zbingwen avatar Nov 23 '23 12:11 zbingwen

Please see https://github.com/StarRocks/starrocks-kubernetes-operator/blob/main/doc/load_data_using_stream_load_howto.md

yandongxiao avatar Nov 24 '23 01:11 yandongxiao

@yandongxiao
都部署在同一个K8S集群,不存在内外网问题。 FLINK JOB 大部分都正常,是时不时出现该异常情况。 、

zbingwen avatar Nov 24 '23 10:11 zbingwen

This is a DNS Resolution issue, can you please give the network & DNS configuration in k8s?

yandongxiao avatar Nov 28 '23 02:11 yandongxiao

If you are using coreDNS, https://github.com/coredns/coredns/issues/4297 and https://github.com/coredns/coredns/issues/4357 issues may help you.

  1. one user said: I have upgraded our cluster to coredns:v1.7.0 and increased cpu resource 0.1 -> 0.5 cpu. Last 10 days we had 0 dns issue. I installed some tools to find sender. For now I will close this issue.
  2. deploying node local dns cache to see if it helps.

Hope these message may help you. And If you have solved the problem, please write a resolution.

yandongxiao avatar Nov 28 '23 02:11 yandongxiao

@yandongxiao coredns: 1.9.3.10 k8s: v1.22.15-aliyun.1 we use aliyun.

I add local /etc/hosts file ,but report this same error.

zbingwen avatar Nov 29 '23 13:11 zbingwen

我将fe 三个节点高可用模式,缩容为一个fe节点,该问题就消失了。 @yandongxiao

zbingwen avatar Nov 29 '23 13:11 zbingwen

Two conclusions can be drawn from the FE Leader's logs:

  1. Multiple threads are reporting SystemInfoService.getBackendWithBePort():556 at the same time. It is suspected that the JVM DNS Cache has expired, but it also failed when trying to get the domain name from the DNS Server.

  2. The context of org.apache.thrift.transport.TTransportException is that when the FE sends HeartBeat information to the BE, the domain name set by the Thrift Client directly calls the DNS Server to get the domain name, but it fails.

Whether /etc/hosts takes effect or not? And scale down FE replicas to 1 may not the root cause.

yandongxiao avatar Nov 30 '23 06:11 yandongxiao

Has the BE's Pod ever experienced a restart ?

yandongxiao avatar Nov 30 '23 06:11 yandongxiao

BE‘s pod is not restart . starrocks cluster health is ok by below command.

SHOW PROC '/frontends'; SHOW PROC '/backends';

zbingwen avatar Nov 30 '23 07:11 zbingwen

@yandongxiao I use fe headless svc , it is working ,not report restarting .

zbingwen avatar Dec 01 '23 02:12 zbingwen

@yandongxiao starrocks stream loading need session keep?

zbingwen avatar Dec 01 '23 02:12 zbingwen

No

yandongxiao avatar Dec 01 '23 02:12 yandongxiao

Caused by: com.starrocks.streamload.shade.org.apache.http.conn.HttpHostConnectException: Connect to starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local:8030 [starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local/10.132.225.119] failed: Connection timed out (Connection timed out)

Can you reduce the frequency of Stream Load and increase the amount of data for each Stream Load?

yandongxiao avatar Dec 01 '23 02:12 yandongxiao

@yandongxiao Yes , we can reduce the frequency of Stream Load and increase the amount of data for each Stream Load. but still report this.

zbingwen avatar Dec 01 '23 07:12 zbingwen