spark
spark copied to clipboard
[SPARK-43861][CORE] Do not delete inprogress log
What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/38983 PR is merged, Spark History Server deletes inprogress logs created before expiration date even if it's still running. So I add condition for check log's completed flag.
Why are the changes needed?
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test
Was this patch authored or co-authored using generative AI tooling?
Thank you for making a PR, @bluzy . Could you elaborate about the problem, specifically,
in progress logs created before expiration date? What is your configuration forspark.history.fs.cleaner.maxAge?Spark History Server deletes inprogress logs created before expiration date even if it's still running.
@dongjoon-hyun
spark.history.fs.cleaner.maxAge is 14d, and my spark streaming application is started 14 days ago.
Spark history server deleted the spark streaming app's log file by it's modification date, and then error occurs like:
java.io.FileNotFoundException: File does not exist: /user/pdd-dev/log/spark/application_1711009858797_276181.inprogress (inode 230295719) Holder DFSClient_NONMAPREDUCE_-582659247_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2890)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:605)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:172)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2769)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
+CC @HeartSaVioR
Thanks, @bluzy . I understand your situation. Why don't you increase spark.history.fs.cleaner.maxAge?
Although I understand it's controversial, spark.history.fs.cleaner.maxAge is defined like the following since Apache Spark 1.4.0
- https://spark.apache.org/docs/latest/monitoring.html
First, they're deleted if they're older than spark.history.fs.cleaner.maxAge.
Note that when driver crashes, the event file remains with .inprogress suffix.
Not deleting these files would result in filling up the event directory - and eventually fail all jobs (depending on the fs limits for number of files per dir).
Yes, Mridul's comment is correct. I believe the AS-IS behavior is robust and safe and the intended one instead of a bug.
WDTY, @bluzy ?
@dongjoon-hyun @mridulm
I think incorrect inprogress file would be deleted on cleaner's schedule, isn't it?
I concen that many spark streaming application can lives forever until needs to be upgraded. It's difficult to set spark.history.fs.cleaner.maxAge value while running streaming job.
PTAL @dongjoon-hyun @mridulm
reader.completed is checking for the IN_PROGRESS suffix - which will be the case here @bluzy: and so with this PR, it will not clean up those files.
(Some users/deployments have out of band workflows that do the move to remove the IN_PROGRESS suffix - but they are not part of Apache Spark itself)