kafka-connect-hdfs Connector fails due to WAL lease related errors

We were ingesting around 100 tables to HDFS and all the sudden started getting following two errors. Both of them are related to WAL file lease. We use Confluent Platform 5.2.0

Any tips how to resolve this ?

[2019-12-27 02:00:47,087] ERROR Failed creating a WAL Writer: Failed to APPEND_FILE /user/datalake/my_service/_incr_files_wal/SERVICE__MYSQL__SERVICE.SERVICE.SPRING_SESSION/0/log for DFSClient_NONMAPREDUCE_-887488635_610 on 100.74.107.48 because the file is under construction but no leases found.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3121)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2905)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:3212)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:3181)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:767)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:433)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
 (io.confluent.connect.hdfs.wal.WALFile)
[2019-12-27 02:00:47,087] INFO Cannot acquire lease on WAL hdfs://nameservice1//user/datalake/my_service/_incr_files_wal/SERVICE__MYSQL__SERVICE.SERVICE.SPRING_SESSION/0/log (io.confluent.connect.hdfs.wal.FSWAL)

[2019-12-27 02:00:47,171] ERROR Recovery failed at state RECOVERY_PARTITION_PAUSED (io.confluent.connect.hdfs.TopicPartitionWriter)
org.apache.kafka.connect.errors.DataException: Error creating writer for log file hdfs://nameservice1//user/datalake/spdb__MAMAWARI__MYSQL__personaldb/_incr_files_wal/MAMAWARI__MYSQL__personaldb.personaldb.family/0/log
	at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:92)
	at io.confluent.connect.hdfs.wal.FSWAL.apply(FSWAL.java:106)
	at io.confluent.connect.hdfs.TopicPartitionWriter.applyWAL(TopicPartitionWriter.java:635)
	at io.confluent.connect.hdfs.TopicPartitionWriter.recover(TopicPartitionWriter.java:259)
	at io.confluent.connect.hdfs.TopicPartitionWriter.write(TopicPartitionWriter.java:324)
	at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:395)
	at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:132)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Filesystem closed
	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:868)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
	at java.io.DataInputStream.readFully(DataInputStream.java:195)
	at java.io.DataInputStream.readFully(DataInputStream.java:169)
	at io.confluent.connect.hdfs.wal.WALFile$Reader.init(WALFile.java:580)
	at io.confluent.connect.hdfs.wal.WALFile$Reader.initialize(WALFile.java:547)
	at io.confluent.connect.hdfs.wal.WALFile$Reader.<init>(WALFile.java:463)
	at io.confluent.connect.hdfs.wal.WALFile$Writer.<init>(WALFile.java:164)
	at io.confluent.connect.hdfs.wal.WALFile.createWriter(WALFile.java:78)
	at io.confluent.connect.hdfs.wal.FSWAL.acquireLease(FSWAL.java:74)
	... 17 more

Dec 27 '19 04:12 ashikaumanga

... because the file is under construction but no leases

This error is coming from Hadoop client. I think you may be running into this HDFS bug

https://issues.apache.org/jira/browse/HDFS-10992
https://issues.apache.org/jira/browse/HDFS-10763

The status on HDFS-10763 indicates it is fixed in 2.7.4. Do you mind try updating the pom.xml to use hadoop.version 2.7.4 and do a local build see if the error goes away? You will need some upstream repositories to do the build locally: https://github.com/confluentinc/kafka-connect-hdfs#development https://github.com/confluentinc/kafka-connect-hdfs/wiki/FAQ

Dec 30 '19 21:12 ncliang

hi, @ncliang , we met the same issue in our environment. After following your advice to rebuild this connector with hadoop.version 2.7.4, the error still exists.

Mar 05 '20 01:03 alex-lx

Hi, I've found this issue to happen numerous times on our hadoop cluster. We run the confluentinc/cp-kafka-connect:5.3.0 image in Rancher. If a connector to HDFS is not deleted before the image dies this error occurs.

The WAL lease error you have is caused by hdfs assuming there is already an open writer to that log file. For example hdfs://nameservice1//user/datalake/spdb__MAMAWARI__MYSQL__personaldb/_incr_files_wal/MAMAWARI__MYSQL__personaldb.personaldb.family/0/log Has an open lease on it already, if you were to run the ls command on the directory you would probably find a log.1 alongside it.

The only way I've managed to fix this is by deleting the original log file as whatever user the directory belongs to. If there is another way to kill the WAL Lease then I would assume this would work also.

May 01 '20 11:05 gregjm94

We've had issues similar to this one ever since we started using the hdfs sink connector a few years ago.

May 05 '20 11:05 beregon87

I have the same issue! any solutions?

ERROR Exception on topic partition ****-0:  (io.confluent.connect.hdfs3.TopicPartitionWriter)
org.apache.kafka.connect.errors.DataException: Error creating writer for log file /****/0/log
	at io.confluent.connect.hdfs3.wal.FSWAL.acquireLease(FSWAL.java:82)
	at io.confluent.connect.hdfs3.wal.FSWAL.append(FSWAL.java:47)
	at io.confluent.connect.hdfs3.TopicPartitionWriter.beginAppend(TopicPartitionWriter.java:733)
	at io.confluent.connect.hdfs3.TopicPartitionWriter.appendToWAL(TopicPartitionWriter.java:724)
	at io.confluent.connect.hdfs3.TopicPartitionWriter.write(TopicPartitionWriter.java:384)
	at io.confluent.connect.hdfs3.DataWriter.write(DataWriter.java:359)
	at io.confluent.connect.hdfs3.Hdfs3SinkTask.put(Hdfs3SinkTask.java:108)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:539)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Mkdirs failed to create /*** (exists=false, cwd=file:/)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458)
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at io.confluent.connect.hdfs3.wal.WALFile$Writer.<init>(WALFile.java:164)
	at io.confluent.connect.hdfs3.wal.WALFile.createWriter(WALFile.java:67)
	at io.confluent.connect.hdfs3.wal.FSWAL.acquireLease(FSWAL.java:64)

Jul 14 '20 12:07 MahsaSeifikar

@MahsaSeifikar Looking at the logs you included, I don't think it is the same issue. Rather, it looks like you have misconfigured the connector logs.dir option.

Jul 15 '20 17:07 beregon87

@beregon87 yup I solved that. Thanks

Jul 21 '20 07:07 MahsaSeifikar