incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Bug] A more elegant way to delete files is needed

Open rickyma opened this issue 1 year ago • 4 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the bug

We need a more elegant way to delete files, rather than deleting them from the local disk first and then from the hdfs every time.

Affects Version(s)

master

Uniffle Server Log Output

[2024-06-07 21:33:45.064] [checkResource-0] [WARN] ShuffleTaskManager.preAllocatedBufferCheck - Remove expired preAllocatedBuffer[id=8311808] that required by app: application_1703049085550_12962744_1717766212505
[2024-06-07 21:33:45.064] [expiredAppCleaner-0] [INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[application_1703049085550_12962744_1717766212505] according to rss.server.app.expired.withoutHeartbeat
[2024-06-07 21:33:45.065] [clearResourceThread] [INFO] ShuffleTaskManager.removeResources - Start remove resource for appId[application_1703049085550_12962744_1717766212505]
[2024-06-07 21:33:45.268] [clearResourceThread] [INFO] HybridStorageManager.removeResources - Start to remove resource of AppPurgeEvent{appId='application_1703049085550_12962744_1717766212505', user='aaa', shuffleIds=[0]}
[2024-06-07 21:33:45.269] [clearResourceThread] [INFO] LocalStorageManager.cleanupStorageSelectionCache - Cleaning the storage selection cache costs: 1(ms) for event: AppPurgeEvent{appId='application_1703049085550_12962744_1717766212505', user='aaa', shuffleIds=[0]}
[2024-06-07 21:33:45.269] [clearResourceThread] [INFO] LocalStorage.removeResources - Start to remove resource of application_1703049085550_12962744_1717766212505/0
[2024-06-07 21:33:45.269] [clearResourceThread] [INFO] LocalStorage.removeResources - Finish remove resource of application_1703049085550_12962744_1717766212505/0, disk size is 0 and 0 shuffle metadata
[2024-06-07 21:33:54.505] [clearResourceThread] [INFO] LocalFileDeleteHandler.delete - Delete shuffle data for appId[application_1703049085550_12962744_1717766212505] with /data1/rssdata/application_1703049085550_12962744_1717766212505 cost 9236 ms
[2024-06-07 21:33:54.505] [clearResourceThread] [INFO] HadoopShuffleDeleteHandler.delete - Try delete shuffle data in Hadoop FS for appId[application_1703049085550_12962744_1717766212505] of user[aaa] with hdfs://xxx/rss/online/application_1703049085550_12962744_1717766212505
[2024-06-07 21:33:54.600] [clearResourceThread] [WARN] HadoopShuffleDeleteHandler.delete - Can't delete shuffle data for appId[application_1703049085550_12962744_1717766212505] with 1 times
java.io.FileNotFoundException: File hdfs://xxx/rss/online/application_1703049085550_12962744_1717766212505 does not exist.
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$800(DistributedFileSystem.java:120)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1060)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:101)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:61)
        at org.apache.uniffle.server.storage.HadoopStorageManager.removeResources(HadoopStorageManager.java:125)
        at org.apache.uniffle.server.storage.HybridStorageManager.removeResources(HybridStorageManager.java:162)
        at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:775)
        at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
        at java.lang.Thread.run(Thread.java:750)
[2024-06-07 21:33:55.636] [clearResourceThread] [WARN] HadoopShuffleDeleteHandler.delete - Can't delete shuffle data for appId[application_1703049085550_12962744_1717766212505] with 2 times
java.io.FileNotFoundException: File hdfs://xxx/rss/online/application_1703049085550_12962744_1717766212505 does not exist.
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$800(DistributedFileSystem.java:120)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1060)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:101)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:61)
        at org.apache.uniffle.server.storage.HadoopStorageManager.removeResources(HadoopStorageManager.java:125)
        at org.apache.uniffle.server.storage.HybridStorageManager.removeResources(HybridStorageManager.java:162)
        at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:775)
        at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
        at java.lang.Thread.run(Thread.java:750)
[2024-06-07 21:33:56.672] [clearResourceThread] [WARN] HadoopShuffleDeleteHandler.delete - Can't delete shuffle data for appId[application_1703049085550_12962744_1717766212505] with 3 times
java.io.FileNotFoundException: File hdfs://xxx/rss/online/application_1703049085550_12962744_1717766212505 does not exist.
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$800(DistributedFileSystem.java:120)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
        at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1060)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:101)
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleDeleteHandler.delete(HadoopShuffleDeleteHandler.java:61)
        at org.apache.uniffle.server.storage.HadoopStorageManager.removeResources(HadoopStorageManager.java:125)
        at org.apache.uniffle.server.storage.HybridStorageManager.removeResources(HybridStorageManager.java:162)
        at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:775)
        at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
        at java.lang.Thread.run(Thread.java:750)

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

rickyma avatar Jun 07 '24 13:06 rickyma

Can you elaborate on the issue a bite more please? What is the current behaviour and what is not elegant about it.

EnricoMi avatar Jun 10 '24 07:06 EnricoMi

When cleaning expired resources, no matter it is an HDFS file or a normal disk file, we always do the following things in HybridStorageManager.removeResources, that's why en exception is thrown:

public void removeResources(PurgeEvent event) {
  LOG.info("Start to remove resource of {}", event);
  warmStorageManager.removeResources(event);
  coldStorageManager.removeResources(event);
}

rickyma avatar Jun 11 '24 02:06 rickyma

You are saying we should not attempt to delete from any storage if the data is not stored there? This means we need to keep track where the data reside.

EnricoMi avatar Jun 12 '24 05:06 EnricoMi

Yeah, it's better this way, so we can reduce a lot of meaningless warn logs.

rickyma avatar Jun 12 '24 06:06 rickyma