spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-43861][CORE] Do not delete inprogress log

Open bluzy opened this issue 1 year ago • 8 comments

What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/38983 PR is merged, Spark History Server deletes inprogress logs created before expiration date even if it's still running. So I add condition for check log's completed flag.

Why are the changes needed?

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

bluzy avatar Apr 12 '24 07:04 bluzy

Thank you for making a PR, @bluzy . Could you elaborate about the problem, specifically, in progress logs created before expiration date? What is your configuration for spark.history.fs.cleaner.maxAge?

Spark History Server deletes inprogress logs created before expiration date even if it's still running.

@dongjoon-hyun spark.history.fs.cleaner.maxAge is 14d, and my spark streaming application is started 14 days ago. Spark history server deleted the spark streaming app's log file by it's modification date, and then error occurs like:

java.io.FileNotFoundException: File does not exist: /user/pdd-dev/log/spark/application_1711009858797_276181.inprogress (inode 230295719) Holder DFSClient_NONMAPREDUCE_-582659247_1 does not have any open files.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2890)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:605)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:172)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2769)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)

bluzy avatar Apr 16 '24 11:04 bluzy

+CC @HeartSaVioR

mridulm avatar Apr 16 '24 15:04 mridulm

Thanks, @bluzy . I understand your situation. Why don't you increase spark.history.fs.cleaner.maxAge?

Although I understand it's controversial, spark.history.fs.cleaner.maxAge is defined like the following since Apache Spark 1.4.0

  • https://spark.apache.org/docs/latest/monitoring.html

First, they're deleted if they're older than spark.history.fs.cleaner.maxAge.

dongjoon-hyun avatar Apr 16 '24 15:04 dongjoon-hyun

Note that when driver crashes, the event file remains with .inprogress suffix. Not deleting these files would result in filling up the event directory - and eventually fail all jobs (depending on the fs limits for number of files per dir).

mridulm avatar Apr 17 '24 01:04 mridulm

Yes, Mridul's comment is correct. I believe the AS-IS behavior is robust and safe and the intended one instead of a bug.

WDTY, @bluzy ?

dongjoon-hyun avatar Apr 17 '24 03:04 dongjoon-hyun

@dongjoon-hyun @mridulm I think incorrect inprogress file would be deleted on cleaner's schedule, isn't it? I concen that many spark streaming application can lives forever until needs to be upgraded. It's difficult to set spark.history.fs.cleaner.maxAge value while running streaming job.

bluzy avatar Apr 17 '24 05:04 bluzy

PTAL @dongjoon-hyun @mridulm

bluzy avatar May 02 '24 08:05 bluzy

reader.completed is checking for the IN_PROGRESS suffix - which will be the case here @bluzy: and so with this PR, it will not clean up those files. (Some users/deployments have out of band workflows that do the move to remove the IN_PROGRESS suffix - but they are not part of Apache Spark itself)

mridulm avatar May 07 '24 04:05 mridulm