Connector fails due to WAL lease related errors
We were ingesting around 100 tables to HDFS and all the sudden started getting following two errors. Both of them are related to WAL file lease. We use Confluent Platform 5.2.0
Any tips how to resolve this ?
[2019-12-27 02:00:47,087] ERROR Failed creating a WAL Writer: Failed to APPEND_FILE /user/datalake/my_service/_incr_files_wal/SERVICE__MYSQL__SERVICE.SERVICE.SPRING_SESSION/0/log for DFSClient_NONMAPREDUCE_-887488635_610 on 100.74.107.48 because the file is under construction but no leases found.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3121)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2905)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:3212)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:3181)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:767)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:433)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
(io.confluent.connect.hdfs.wal.WALFile)
[2019-12-27 02:00:47,087] INFO Cannot acquire lease on WAL hdfs://nameservice1//user/datalake/my_service/_incr_files_wal/SERVICE__MYSQL__SERVICE.SERVICE.SPRING_SESSION/0/log (io.confluent.connect.hdfs.wal.FSWAL)
[2019-12-27 02:00:47,171] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
org.apache.kafka.connect.errors.DataException: Error creating writer for log file hdfs://nameservice1//user/datalake/spdb__MAMAWARI__MYSQL__personaldb/_incr_files_wal/MAMAWARI__MYSQL__personaldb.personaldb.family/0/log
at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:92)
at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:106)
at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:635)
at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:259)
at io.confluent.connect.hdfs.TopicPartitionWriter.write(TopicPartitionWriter.java:324)
at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:395)
at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:132)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:868)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at io.confluent.connect.hdfs.wal.WALFile$Reader.init(WALFile.java:580)
at io.confluent.connect.hdfs.wal.WALFile$Reader.initialize(WALFile.java:547)
at io.confluent.connect.hdfs.wal.WALFile$Reader.<init>(WALFile.java:463)
at io.confluent.connect.hdfs.wal.WALFile$Writer.<init>(WALFile.java:164)
at io.confluent.connect.hdfs.wal.WALFile.createWriter(WALFile.java:78)
at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:74)
... 17 more
... because the file is under construction but no leases
This error is coming from Hadoop client. I think you may be running into this HDFS bug
- https://issues.apache.org/jira/browse/HDFS-10992
- https://issues.apache.org/jira/browse/HDFS-10763
The status on HDFS-10763 indicates it is fixed in 2.7.4. Do you mind try updating the pom.xml to use hadoop.version 2.7.4 and do a local build see if the error goes away? You will need some upstream repositories to do the build locally: https://github.com/confluentinc/kafka-connect-hdfs#development https://github.com/confluentinc/kafka-connect-hdfs/wiki/FAQ
hi, @ncliang , we met the same issue in our environment. After following your advice to rebuild this connector with hadoop.version 2.7.4, the error still exists.
Hi, I've found this issue to happen numerous times on our hadoop cluster. We run the confluentinc/cp-kafka-connect:5.3.0 image in Rancher. If a connector to HDFS is not deleted before the image dies this error occurs.
The WAL lease error you have is caused by hdfs assuming there is already an open writer to that log file. For example
hdfs://nameservice1//user/datalake/spdb__MAMAWARI__MYSQL__personaldb/_incr_files_wal/MAMAWARI__MYSQL__personaldb.personaldb.family/0/log
Has an open lease on it already, if you were to run the ls command on the directory you would probably find a log.1 alongside it.
The only way I've managed to fix this is by deleting the original log file as whatever user the directory belongs to. If there is another way to kill the WAL Lease then I would assume this would work also.
We've had issues similar to this one ever since we started using the hdfs sink connector a few years ago.
I have the same issue! any solutions?
ERROR Exception on topic partition ****-0: (io.confluent.connect.hdfs3.TopicPartitionWriter)
org.apache.kafka.connect.errors.DataException: Error creating writer for log file /****/0/log
at io.confluent.connect.hdfs3.wal.FSWAL.acquireLease(FSWAL.java:82)
at io.confluent.connect.hdfs3.wal.FSWAL.append(FSWAL.java:47)
at io.confluent.connect.hdfs3.TopicPartitionWriter.beginAppend(TopicPartitionWriter.java:733)
at io.confluent.connect.hdfs3.TopicPartitionWriter.appendToWAL(TopicPartitionWriter.java:724)
at io.confluent.connect.hdfs3.TopicPartitionWriter.write(TopicPartitionWriter.java:384)
at io.confluent.connect.hdfs3.DataWriter.write(DataWriter.java:359)
at io.confluent.connect.hdfs3.Hdfs3SinkTask.put(Hdfs3SinkTask.java:108)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:539)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Mkdirs failed to create /*** (exists=false, cwd=file:/)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
at io.confluent.connect.hdfs3.wal.WALFile$Writer.<init>(WALFile.java:164)
at io.confluent.connect.hdfs3.wal.WALFile.createWriter(WALFile.java:67)
at io.confluent.connect.hdfs3.wal.FSWAL.acquireLease(FSWAL.java:64)
@MahsaSeifikar Looking at the logs you included, I don't think it is the same issue. Rather, it looks like you have misconfigured the connector logs.dir option.
@beregon87 yup I solved that. Thanks