starrocks-kubernetes-operator
starrocks-kubernetes-operator copied to clipboard
Flink starrocks stream JOB found restarting due to Connection fe timed out
Describe the bug
Flink starrocks stream JOB found restarting due to Connection fe timed out
Please complete the following information
- Operator Version: [ v1.8.6]
- starrocks 3.1.3
log:
2023-11-23 20:14:34
java.io.IOException: Could not perform checkpoint 2172 for operator Sink: sink_2_starrocks_job[3] (3/3)#126.
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1256)
at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierTracker.processBarrier(CheckpointBarrierTracker.java:107)
at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:118)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:750)
Suppressed: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.close(StarRocksDynamicSinkFunctionV2.java:285)
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:115)
at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125)
at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1043)
at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255)
at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72)
at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:951)
at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$0(Task.java:934)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:934)
... 3 more
Caused by: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:349)
at com.starrocks.data.load.stream.DefaultStreamLoader.lambda$send$3(DefaultStreamLoader.java:172)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: com.starrocks.streamload.shade.org.apache.http.conn.HttpHostConnectException: Connect to starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local:8030 [starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local/10.132.225.119] failed: Connection timed out (Connection timed out)
at com.starrocks.streamload.shade.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
at com.starrocks.streamload.shade.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at com.starrocks.streamload.shade.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at com.starrocks.streamload.shade.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at com.starrocks.streamload.shade.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at com.starrocks.streamload.shade.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
at com.starrocks.data.load.stream.DefaultStreamLoader.send(DefaultStreamLoader.java:304)
... 7 more
Caused by: java.net.ConnectException: Connection timed out (Connection timed out)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at com.starrocks.streamload.shade.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
at com.starrocks.streamload.shade.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
... 17 more
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 2172 for operator Sink: sink_2_starrocks_job[3] (3/3)#126. Failure reason: Checkpoint was declined.
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:269)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:173)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:336)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.checkpointStreamOperator(RegularOperatorChain.java:228)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.buildOperatorSnapshotFutures(RegularOperatorChain.java:213)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:192)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:715)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:350)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$14(StreamTask.java:1299)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1287)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1244)
... 15 more
Caused by: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.snapshotState(StarRocksDynamicSinkFunctionV2.java:298)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:88)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:222)
... 26 more
Caused by: [CIRCULAR REFERENCE: com.starrocks.data.load.stream.exception.StreamLoadFailException: Stream load failed because of unknown exception, db: ydata, table: table_test, label: sink_2_starrocks_job_svc_-e51f7db0-acb9-4fb7-8ae8-78fe9d780265]
I found below information from fe-leader:
-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,045 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,467 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:18,589 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,466 WARN (thrift-server-pool-6|181) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,467 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:19,589 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:20,468 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:20,590 WARN (thrift-server-pool-150|484) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:21,468 WARN (thrift-server-pool-6|181) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:21,590 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,019 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,020 WARN (thrift-server-pool-150|484) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,469 WARN (thrift-server-pool-5|180) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:22,591 WARN (thrift-server-pool-151|485) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,003 WARN (heartbeat-mgr-pool-1|129) [HeartbeatMgr$BackendHeartbeatHandler.call():315] backend heartbeat got exception, addr: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local:9050
org.apache.thrift.transport.TTransportException: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
at org.apache.thrift.transport.TSocket.open(TSocket.java:226) ~[libthrift-0.13.0.jar:0.13.0]
at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:144) ~[starrocks-fe.jar:?]
at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:129) ~[starrocks-fe.jar:?]
at org.apache.commons.pool2.BaseKeyedPooledObjectFactory.makeObject(BaseKeyedPooledObjectFactory.java:62) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1036) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:356) ~[commons-pool2-2.3.jar:2.3]
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:278) ~[commons-pool2-2.3.jar:2.3]
at com.starrocks.common.GenericPool.borrowObject(GenericPool.java:101) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:265) ~[starrocks-fe.jar:?]
at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:251) ~[starrocks-fe.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:229) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.thrift.transport.TSocket.open(TSocket.java:221) ~[libthrift-0.13.0.jar:0.13.0]
... 13 more
2023-11-22 11:06:23,004 WARN (heartbeat mgr|13) [HeartbeatMgr.runAfterCatalogReady():164] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.UnknownHostException: starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,469 WARN (thrift-server-pool-4|179) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
2023-11-22 11:06:23,591 WARN (thrift-server-pool-152|486) [SystemInfoService.getBackendWithBePort():556] failed to get right ip by fqdn starrocks-cluster-be-0.starrocks-cluster-be-search.staging-ydata-v30.svc.cluster.local
@kevincai Please help me , thank you in advance.
Please see https://github.com/StarRocks/starrocks-kubernetes-operator/blob/main/doc/load_data_using_stream_load_howto.md
@yandongxiao
都部署在同一个K8S集群,不存在内外网问题。 FLINK JOB 大部分都正常,是时不时出现该异常情况。 、
This is a DNS Resolution issue, can you please give the network & DNS configuration in k8s?
If you are using coreDNS, https://github.com/coredns/coredns/issues/4297 and https://github.com/coredns/coredns/issues/4357 issues may help you.
- one user said: I have upgraded our cluster to coredns:v1.7.0 and increased cpu resource 0.1 -> 0.5 cpu. Last 10 days we had 0 dns issue. I installed some tools to find sender. For now I will close this issue.
- deploying node local dns cache to see if it helps.
Hope these message may help you. And If you have solved the problem, please write a resolution.
@yandongxiao coredns: 1.9.3.10 k8s: v1.22.15-aliyun.1 we use aliyun.
I add local /etc/hosts file ,but report this same error.
我将fe 三个节点高可用模式,缩容为一个fe节点,该问题就消失了。 @yandongxiao
Two conclusions can be drawn from the FE Leader's logs:
-
Multiple threads are reporting SystemInfoService.getBackendWithBePort():556 at the same time. It is suspected that the JVM DNS Cache has expired, but it also failed when trying to get the domain name from the DNS Server.
-
The context of org.apache.thrift.transport.TTransportException is that when the FE sends HeartBeat information to the BE, the domain name set by the Thrift Client directly calls the DNS Server to get the domain name, but it fails.
Whether /etc/hosts takes effect or not? And scale down FE replicas to 1 may not the root cause.
Has the BE's Pod ever experienced a restart ?
BE‘s pod is not restart . starrocks cluster health is ok by below command.
SHOW PROC '/frontends'; SHOW PROC '/backends';
@yandongxiao I use fe headless svc , it is working ,not report restarting .
@yandongxiao starrocks stream loading need session keep?
No
Caused by: com.starrocks.streamload.shade.org.apache.http.conn.HttpHostConnectException: Connect to starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local:8030 [starrocks-cluster-fe-2.starrocks-cluster-fe-search.staging-ydata-v30.svc.cluster.local/10.132.225.119] failed: Connection timed out (Connection timed out)
Can you reduce the frequency of Stream Load and increase the amount of data for each Stream Load?
@yandongxiao Yes , we can reduce the frequency of Stream Load and increase the amount of data for each Stream Load. but still report this.