datasets icon indicating copy to clipboard operation
datasets copied to clipboard

`datasets/downloads` cleanup tool

Open stas00 opened this issue 1 year ago • 0 comments

Feature request

Splitting off https://github.com/huggingface/huggingface_hub/issues/1997 - currently huggingface-cli delete-cache doesn't take care of cleaning datasets temp files

e.g. I discovered having millions of files under datasets/downloads cache, I had to do:

sudo find /data/huggingface/datasets/downloads -type f -mtime +3 -exec rm {} \+
sudo find /data/huggingface/datasets/downloads -type d -empty -delete

could the cleanup be integrated into huggingface-cli or a different tool provided to keep the folders tidy and not consume inodes and space

e.g. there were tens of thousands of .lock files - I don't know why they never get removed - lock files should be temporary for the duration of the operation requiring the lock and not remain after the operation finished, IMHO.

Also I think one should be able to nuke datasets/downloads w/o hurting the cache, but I think there are some datasets that rely on files extracted under this dir - or at least they did in the past - which is very difficult to manage since one has no idea what is safe to delete and what not.

Thank you

@Wauplin (requested to be tagged)

stas00 avatar Jan 24 '24 18:01 stas00