bitsail icon indicating copy to clipboard operation
bitsail copied to clipboard

When loading a savepoint file , FileSystemNotFoundException occurs

Open JODYQU opened this issue 2 years ago • 2 comments
trafficstars

Describe the bug

In kafka-hudi example,a hudi file is created successfully and flink checkpoint file also is created in hdfs, but when stopping current job and rerun it again, it first trys to load a savepoint file from configured checkpoint file path, but loading savepoint failed with exception : java.nio.file.FileSystemNotFoundException: Provider "hdfs" not installed 。

To Reproduce

1) run a job , wait for a file, checkpoint file is created 2) rerun this job, loading a savepoint file, failed with the above exception

We can reproduce the bug as follows (use screenshots if necessary):

  1. The job conf is: { "job":{ "common":{ "job_type":"STREAMING", "enable_dynamic_loader":false, "job_id":-1, "sync_ddl":false, "instance_id": 3124, "user_name": "test", "job_name": "BitSail_kafka_hudi_test1", "dry_run":true }, "reader":{ "format_type":"json", "connector":{ "connector":{ "auto":{ "offset":{ "reset":"earliest" } },
    "topic":"testTopic", "startup-mode":"earliest-offset", "bootstrap":{ "servers":"PLAINTEXT://10.7.160.32:9092" }, "group":{ "id":"shuketest2" } } }, "count_mode_record_threshold":2000, "columns":[ { "upperCaseName":"TEXT_TYPE", "name":"text_type", "type":"string" }, { "upperCaseName":"VARCHAR_TYPE", "name":"varchar_type", "type":"string" }, { "upperCaseName":"BIGINT_TYPE", "name":"bigint_type", "type":"long" } ], "metric_tag_name":"kafka", "class":"com.bytedance.bitsail.connector.legacy.kafka.source.KafkaSourceFunctionDAGBuilder", "enable_count_mode":true }, "writer":{ "hoodie":{ "bucket":{ "index":{ "num":{ "buckets":"4" }, "hash":{ "field":"text_type" } } }, "datasource":{ "write":{ "recordkey":{ "field":"text_type" } } }, "table":{ "name":"test_hudi" } }, "path":"hdfs://nameservice/example_data/test_hudi", "columns":[ { "upperCaseName":"TEXT_TYPE", "name":"text_type", "type":"string" }, { "upperCaseName":"VARCHAR_TYPE", "name":"varchar_type", "type":"string" }, { "upperCaseName":"BIGINT_TYPE", "name":"bigint_type", "type":"bigint" } ], "index":{ "type":"BUCKET" }, "class":"com.bytedance.bitsail.connector.legacy.hudi.dag.HudiSinkFunctionDAGBuilder", "write":{ "operation":"upsert", "precombine":{ "field":"varchar_type" } }, "table":{ "type":"MERGE_ON_READ" }, "source_schema": "[{"name":"text_type","type":"string"},{"name":"varchar_type","type":"string"},{"name":"bigint_type","type":"long"}]", "sink_schema": "[{"name":"text_type","type":"string"},{"name":"varchar_type","type":"string"},{"name":"bigint_type","type":"bigint"}]" } } }

  2. The test data is: '....' from a created kafka source with data

  3. Bitsail run command: '....' bash ./bin/bitsail run --engine flink --conf ./examples/Kafka_Hudi_Example.json --execution-mode run -p 1=1 --deployment-mode yarn-per-job --queue default

  4. The error can be seen in: '....' [hdfs@shuke3 bitsail]$ bash ./bin/bitsail run --engine flink --conf ./examples/Kafka_Hudi_Example.json --execution-mode run -p 1=1 --deployment-mode yarn-per-job --queue default SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/bigdata/bitsail/libs/clients/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.5.0-152/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder] 19:00:45.602 [main] INFO com.bytedance.bitsail.client.entry.Entry - Load engine flink from classpath. 19:00:45.670 [main] INFO com.bytedance.bitsail.component.format.security.kerberos.security.KerberosAbstractSecurityModule - Kerberos principal=null, keytabPath=null, krb5confPath=null 19:00:45.670 [main] INFO com.bytedance.bitsail.component.format.security.kerberos.security.KerberosAbstractSecurityModule - Kerberos security module initialized. 19:00:45.671 [main] INFO com.bytedance.bitsail.client.entry.security.SecurityContextFactory - Module HadoopSecurityModule start login. 19:00:45.671 [main] INFO com.bytedance.bitsail.component.format.security.kerberos.security.HadoopSecurityModule - Hadoop module disabled. 19:00:45.679 [main] INFO com.bytedance.bitsail.client.entry.Entry - Final engine: flink. 19:00:45.684 [main] INFO com.bytedance.bitsail.entry.flink.engine.FlinkEngineRunner - Find flink dir = /bigdata/bitsail/embedded/flink in System configuration. 19:00:45.685 [main] INFO com.bytedance.bitsail.entry.flink.engine.FlinkEngineRunner - Load flink library from path: /bigdata/bitsail/embedded/flink/./lib. 19:00:45.698 [main] INFO com.bytedance.bitsail.entry.flink.engine.FlinkEngineRunner - Load flink runtime library /bigdata/bitsail/embedded/flink/./lib/flink-dist_2.11-1.11.6.jar to classpath. 19:00:45.712 [main] INFO com.bytedance.bitsail.entry.flink.savepoint.FlinkRunnerSavepointLoader - Flink checkpoint work dir = hdfs://nameservice/bitsail/flink-1.11/checkpoints/BitSail_kafka_hudi_test1. 19:00:46.038 [main] DEBUG org.apache.hadoop.util.Shell - setsid exited with exit code 0 19:00:46.096 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of successful kerberos logins and latency (milliseconds)]) 19:00:46.101 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of failed kerberos logins and latency (milliseconds)]) 19:00:46.102 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[GetGroups]) 19:00:46.102 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field private org.apache.hadoop.metrics2.lib.MutableGaugeLong org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Renewal failures since startup]) 19:00:46.102 [main] DEBUG org.apache.hadoop.metrics2.lib.MutableMetricsFactory - field private org.apache.hadoop.metrics2.lib.MutableGaugeInt org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailures with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Renewal failures since last successful login]) 19:00:46.103 [main] DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl - UgiMetrics, User and group related metrics 19:00:46.117 [main] DEBUG org.apache.hadoop.security.SecurityUtil - Setting hadoop.security.token.service.use_ip to true 19:00:46.130 [main] DEBUG org.apache.hadoop.security.Groups - Creating new Groups object 19:00:46.132 [main] DEBUG org.apache.hadoop.util.NativeCodeLoader - Trying to load the custom-built native-hadoop library... 19:00:46.132 [main] DEBUG org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library 19:00:46.133 [main] DEBUG org.apache.hadoop.security.JniBasedUnixGroupsMapping - Using JniBasedUnixGroupsMapping for Group resolution 19:00:46.133 [main] DEBUG org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback - Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMapping 19:00:46.241 [main] DEBUG org.apache.hadoop.security.Groups - Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=300000; warningDeltaMs=5000 19:00:46.247 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login 19:00:46.248 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - hadoop login commit 19:00:46.257 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - using local user:UnixPrincipal: hdfs 19:00:46.257 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - Using user: "UnixPrincipal: hdfs" with name hdfs 19:00:46.258 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - User entry: "hdfs" 19:00:46.258 [main] DEBUG org.apache.hadoop.security.UserGroupInformation - UGI loginUser:hdfs (auth:SIMPLE) 19:00:46.299 [main] DEBUG org.apache.hadoop.fs.FileSystem - Loading filesystems 19:00:46.312 [main] DEBUG org.apache.hadoop.fs.FileSystem - file:// = class org.apache.hadoop.fs.LocalFileSystem from /usr/hdp/3.1.5.0-152/hadoop/hadoop-common-3.1.1.3.1.5.0-152.jar 19:00:46.317 [main] DEBUG org.apache.hadoop.fs.FileSystem - viewfs:// = class org.apache.hadoop.fs.viewfs.ViewFileSystem from /usr/hdp/3.1.5.0-152/hadoop/hadoop-common-3.1.1.3.1.5.0-152.jar 19:00:46.319 [main] DEBUG org.apache.hadoop.fs.FileSystem - har:// = class org.apache.hadoop.fs.HarFileSystem from /usr/hdp/3.1.5.0-152/hadoop/hadoop-common-3.1.1.3.1.5.0-152.jar 19:00:46.321 [main] DEBUG org.apache.hadoop.fs.FileSystem - http:// = class org.apache.hadoop.fs.http.HttpFileSystem from /usr/hdp/3.1.5.0-152/hadoop/hadoop-common-3.1.1.3.1.5.0-152.jar 19:00:46.322 [main] DEBUG org.apache.hadoop.fs.FileSystem - https:// = class org.apache.hadoop.fs.http.HttpsFileSystem from /usr/hdp/3.1.5.0-152/hadoop/hadoop-common-3.1.1.3.1.5.0-152.jar 19:00:46.331 [main] DEBUG org.apache.hadoop.fs.FileSystem - hdfs:// = class org.apache.hadoop.hdfs.DistributedFileSystem from /usr/hdp/3.1.5.0-152/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.5.0-152.jar 19:00:46.348 [main] DEBUG org.apache.hadoop.fs.FileSystem - webhdfs:// = class org.apache.hadoop.hdfs.web.WebHdfsFileSystem from /usr/hdp/3.1.5.0-152/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.5.0-152.jar 19:00:46.350 [main] DEBUG org.apache.hadoop.fs.FileSystem - swebhdfs:// = class org.apache.hadoop.hdfs.web.SWebHdfsFileSystem from /usr/hdp/3.1.5.0-152/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.5.0-152.jar 19:00:46.500 [main] DEBUG org.apache.hadoop.fs.FileSystem - gs:// = class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem from /usr/hdp/3.1.5.0-152/hadoop-mapreduce/gcs-connector-hadoop3-1.9.17.3.1.5.0-152-shaded.jar 19:00:46.504 [main] DEBUG org.apache.hadoop.fs.FileSystem - s3n:// = class org.apache.hadoop.fs.s3native.NativeS3FileSystem from /usr/hdp/3.1.5.0-152/hadoop-mapreduce/hadoop-aws-3.1.1.3.1.5.0-152.jar 19:00:46.506 [main] DEBUG org.apache.hadoop.fs.FileSystem - Looking for FS supporting hdfs 19:00:46.507 [main] DEBUG org.apache.hadoop.fs.FileSystem - looking for configuration option fs.hdfs.impl 19:00:46.532 [main] DEBUG org.apache.hadoop.fs.FileSystem - Looking in service filesystems for implementation class 19:00:46.533 [main] DEBUG org.apache.hadoop.fs.FileSystem - FS for hdfs is class org.apache.hadoop.hdfs.DistributedFileSystem 19:00:46.575 [main] DEBUG org.apache.hadoop.hdfs.client.impl.DfsClientConf - dfs.client.use.legacy.blockreader.local = false 19:00:46.575 [main] DEBUG org.apache.hadoop.hdfs.client.impl.DfsClientConf - dfs.client.read.shortcircuit = true 19:00:46.576 [main] DEBUG org.apache.hadoop.hdfs.client.impl.DfsClientConf - dfs.client.domain.socket.data.traffic = false 19:00:46.576 [main] DEBUG org.apache.hadoop.hdfs.client.impl.DfsClientConf - dfs.domain.socket.path = /var/lib/hadoop-hdfs/dn_socket 19:00:46.596 [main] DEBUG org.apache.hadoop.hdfs.DFSClient - Sets dfs.client.block.write.replace-datanode-on-failure.min-replication to 0 19:00:46.621 [main] DEBUG org.apache.hadoop.hdfs.HAUtilClient - No HA service delegation token found for logical URI hdfs://nameservice 19:00:46.621 [main] DEBUG org.apache.hadoop.hdfs.client.impl.DfsClientConf - dfs.client.use.legacy.blockreader.local = false 19:00:46.622 [main] DEBUG org.apache.hadoop.hdfs.client.impl.DfsClientConf - dfs.client.read.shortcircuit = true 19:00:46.622 [main] DEBUG org.apache.hadoop.hdfs.client.impl.DfsClientConf - dfs.client.domain.socket.data.traffic = false 19:00:46.622 [main] DEBUG org.apache.hadoop.hdfs.client.impl.DfsClientConf - dfs.domain.socket.path = /var/lib/hadoop-hdfs/dn_socket 19:00:46.638 [main] DEBUG org.apache.hadoop.io.retry.RetryUtils - multipleLinearRandomRetry = null 19:00:46.664 [main] DEBUG org.apache.hadoop.ipc.Server - rpcKind=RPC_PROTOCOL_BUFFER, rpcRequestWrapperClass=class org.apache.hadoop.ipc.ProtobufRpcEngine$RpcProtobufRequest, rpcInvoker=org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker@7d7758be 19:00:46.683 [main] DEBUG org.apache.hadoop.ipc.Client - getting client out of cache: org.apache.hadoop.ipc.Client@21d03963 19:00:47.422 [client DomainSocketWatcher] DEBUG org.apache.hadoop.net.unix.DomainSocketWatcher - org.apache.hadoop.net.unix.DomainSocketWatcher$2@1dd4f2c5: starting with interruptCheckPeriodMs = 60000 19:00:47.431 [main] DEBUG org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The short-circuit local reads feature is enabled. 19:00:47.440 [main] DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil - DataTransferProtocol not using SaslPropertiesResolver, no QOP found in configuration for dfs.data.transfer.protection 19:00:47.447 [main] INFO com.bytedance.bitsail.entry.flink.savepoint.FlinkRunnerSavepointLoader - Starting load checkpoint path for work dir = hdfs://nameservice/bitsail/flink-1.11/checkpoints/BitSail_kafka_hudi_test1. 19:00:47.481 [main] DEBUG org.apache.hadoop.ipc.Client - The ping interval is 60000 ms. 19:00:47.482 [main] DEBUG org.apache.hadoop.ipc.Client - Connecting to shuke1.test/10.7.160.32:8020 19:00:47.503 [IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs: starting, having connections 1 19:00:47.507 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs sending #0 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo 19:00:47.513 [IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs got value #0 19:00:47.520 [main] DEBUG org.apache.hadoop.io.retry.RetryInvocationHandler - org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1951) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1427) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3100) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1154) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:966) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) , while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over shuke1.test/10.7.160.32:8020. Trying to failover immediately. org.apache.hadoop.ipc.RemoteException: Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1951) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1427) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3100) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1154) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:966) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498)
     at org.apache.hadoop.ipc.Client.call(Client.java:1444)
     at org.apache.hadoop.ipc.Client.call(Client.java:1354)
     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
     at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:498)
     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
     at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
     at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
     at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660)
     at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
     at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
     at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
     at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
     at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)
     at com.bytedance.bitsail.entry.flink.savepoint.FlinkRunnerSavepointLoader.loadLatestCheckpointPath(FlinkRunnerSavepointLoader.java:120)
     at com.bytedance.bitsail.entry.flink.savepoint.FlinkRunnerSavepointLoader.loadSavepointPath(FlinkRunnerSavepointLoader.java:93)
     at com.bytedance.bitsail.entry.flink.engine.FlinkEngineRunner.getRunProcBuilder(FlinkEngineRunner.java:125)
     at com.bytedance.bitsail.entry.flink.engine.FlinkEngineRunner.getProcBuilder(FlinkEngineRunner.java:104)
     at com.bytedance.bitsail.client.entry.Entry.buildProcessBuilder(Entry.java:136)
     at com.bytedance.bitsail.client.entry.Entry.runCommand(Entry.java:118)
     at com.bytedance.bitsail.client.entry.security.SecurityContextFactory.doAs(SecurityContextFactory.java:71)
     at com.bytedance.bitsail.client.entry.Entry.main(Entry.java:90)
    

19:00:47.521 [main] DEBUG org.apache.hadoop.io.retry.RetryUtils - multipleLinearRandomRetry = null 19:00:47.522 [main] DEBUG org.apache.hadoop.ipc.Client - getting client out of cache: org.apache.hadoop.ipc.Client@21d03963 19:00:47.522 [main] DEBUG org.apache.hadoop.ipc.Client - The ping interval is 60000 ms. 19:00:47.522 [main] DEBUG org.apache.hadoop.ipc.Client - Connecting to shuke2.test/10.7.160.33:8020 19:00:47.523 [IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs: starting, having connections 2 19:00:47.523 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs sending #0 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo 19:00:47.524 [IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs got value #0 19:00:47.524 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: getFileInfo took 2ms 19:00:47.568 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs sending #1 org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing 19:00:47.569 [IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs got value #1 19:00:47.569 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: getListing took 1ms 19:00:47.590 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs sending #2 org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing 19:00:47.590 [IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs got value #2 19:00:47.591 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: getListing took 2ms 19:00:47.592 [IPC Parameter Sending Thread #0] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs sending #3 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo 19:00:47.593 [IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs got value #3 19:00:47.594 [main] DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: getFileInfo took 2ms 19:00:47.604 [main] ERROR com.bytedance.bitsail.client.entry.Entry - Exception occurred when run command . java.nio.file.FileSystemNotFoundException: Provider "hdfs" not installed at java.nio.file.Paths.get(Paths.java:147) at com.bytedance.bitsail.entry.flink.savepoint.FlinkRunnerSavepointLoader.loadLatestCheckpointPath(FlinkRunnerSavepointLoader.java:152) at com.bytedance.bitsail.entry.flink.savepoint.FlinkRunnerSavepointLoader.loadSavepointPath(FlinkRunnerSavepointLoader.java:93) at com.bytedance.bitsail.entry.flink.engine.FlinkEngineRunner.getRunProcBuilder(FlinkEngineRunner.java:125) at com.bytedance.bitsail.entry.flink.engine.FlinkEngineRunner.getProcBuilder(FlinkEngineRunner.java:104) at com.bytedance.bitsail.client.entry.Entry.buildProcessBuilder(Entry.java:136) at com.bytedance.bitsail.client.entry.Entry.runCommand(Entry.java:118) at com.bytedance.bitsail.client.entry.security.SecurityContextFactory.doAs(SecurityContextFactory.java:71) at com.bytedance.bitsail.client.entry.Entry.main(Entry.java:90) 19:00:47.608 [shutdown-hook-0] DEBUG org.apache.hadoop.ipc.Client - stopping client from cache: org.apache.hadoop.ipc.Client@21d03963 19:00:47.609 [shutdown-hook-0] DEBUG org.apache.hadoop.ipc.Client - stopping client from cache: org.apache.hadoop.ipc.Client@21d03963 19:00:47.609 [shutdown-hook-0] DEBUG org.apache.hadoop.ipc.Client - removing client from cache: org.apache.hadoop.ipc.Client@21d03963 19:00:47.609 [shutdown-hook-0] DEBUG org.apache.hadoop.ipc.Client - stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@21d03963 19:00:47.609 [shutdown-hook-0] DEBUG org.apache.hadoop.ipc.Client - Stopping client 19:00:47.610 [IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs: closed 19:00:47.610 [IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs: closed 19:00:47.610 [IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke1.test/10.7.160.32:8020 from hdfs: stopped, remaining connections 0 19:00:47.610 [IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs] DEBUG org.apache.hadoop.ipc.Client - IPC Client (1605851606) connection to shuke2.test/10.7.160.33:8020 from hdfs: stopped, remaining connections 0 19:00:47.613 [Thread-5] DEBUG org.apache.hadoop.util.ShutdownHookManager - Completed shutdown in 0.006 seconds; Timeouts: 0 19:00:47.640 [Thread-5] DEBUG org.apache.hadoop.util.ShutdownHookManager - ShutdownHookManger completed shutdown.

(Note that the above sentences are not necessary for each bug report issue.)

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Build Environment

Describe the environment where the test project was build.

  • OS [e.g. Debain 11.0]
  • JDK [e.g. 1.8.0_301]
  • Maven [e.g. 3.6.3]
  • ...

Execution Environment

Describe the environment where the test project ran.

  • OS : centos 7.9
  • JDK 1.8.0_310
  • Deployment: HDP yarn
  • ...

Additional context

Add any other context about the problem here.

JODYQU avatar Jan 14 '23 12:01 JODYQU

please have a look about this issue @garyli1019

lichang-bd avatar Feb 08 '23 06:02 lichang-bd

@JODYQU hello, thanks for reporting this issue. Do you have other Flink job in your yarn cluster? If you do, is the checkpoint running ok? This looks like an error occurred between Flink and HDFS.

garyli1019 avatar Feb 08 '23 08:02 garyli1019