huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

`huggingface-cli delete-cache --disable-tui` improvements

Open stas00 opened this issue 1 year ago • 13 comments

I tried huggingface-cli delete-cache --disable-tui for the first time. Great intention, very problematic usage when one had thousands of hub objects to cleanup. Once I understood its quirks I was able to hack around those problems.

  1. Confusing UX
huggingface-cli delete-cache --disable-tui
TUI is disabled. In order to select which revisions you want to delete, please edit
the following file using the text editor of your choice. Instructions for manual
editing are located at the beginning of the file. Edit the file, save it and confirm
to continue.
File to edit: /tmp/tmpundr7lky.txt
0 revisions selected counting for 0.0. Continue ? (y/N)

The doc says to hit y but the program exits if y is hit and all careful manual editing is lost and the user has to start from scratch (ouch!)

I suspect this is a bug or a problem in the workflow.

  1. Please don't use a true temp file, use a file that won't get deleted and a user can re-use it should they hit the wrong button - see Issue 1 above as an example.

  2. sorting to have main last consistently would help. e.g. the first few entries had a consistent-main-last listing as in:

# Model mistralai/Mistral-7B-v0.1 (29.5G, used 19 hours ago)
    5e9c98b96d071dce59368012254c55b0ec6f8658 # Refs: (detached) # modified 3 months ago
    17688039ea7b7001c702797eed2ab7716a0cc3c2 # Refs: (detached) # modified 6 weeks ago
#    26bca36bde8333b5d7f72e9ed20ccda6a618af24 # Refs: main # modified 6 weeks ago

so I started to manually uncomment lines thinking main is always last, w/o paying close attention, but luckily I caught this was inconsistent as then I run into:

# Model mistralai/Mixtral-8x7B-v0.1 (93.4G, used 10 hours ago)
#    985aa055896a8f943d4a9f2572e6ea1341823841 # Refs: main # modified 5 weeks ago
    58301445dc1378584211722b7ebf8743ec4e192b # Refs: (detached) # modified 5 weeks ago

and many other variations. For those who need to edit hundreds of these, it'd be great to have main first or last - probably actually first would be the easiest.

  1. give a user a way to delete all non-main revisions in one go

I tried to do it manually and it was super slow and I was concerned my edits will get lost again if I hit Y instead of the confusing N (see Issue 1)

at the end I resorted to this hack:

cp /tmp/tmpedbz00ox.txt cache.txt
perl -pi -e 's|^#(.*detached.*)|$1|' cache.txt
cat cache.txt >>  /tmp/tmpundr7lky.txt

and hit N, Y, Y

so I wiped out hundreds of old revisions in a second w/o manual editing.

This is usually what users want - keep the main, get rid of old revs - would it be possible to create such option?

Thank you!

stas00 avatar Jan 23 '24 01:01 stas00

Thanks for the great feedback @stas00! I haven't looked into this for a long time but all suggestions seems to make sense. Will keep you updated when I start working on that :)

Wauplin avatar Jan 23 '24 14:01 Wauplin

Super! The foundation is awesome, @Wauplin - just needs a bit of polish on top.

The other problem this tool isn't take care of is cleaning downloads/ - I checked and we had 2M of files there! I had to do:

sudo find /data/huggingface/datasets/downloads -type f -mtime +3 -exec rm {} \+
sudo find /data/huggingface/datasets/downloads -type d -empty -delete

I'm not sure if huggingface-cli could take care of datasets as well, since they come from the hub. Please let me know if I should open a separate issue, since it's related but not the same.

stas00 avatar Jan 23 '24 19:01 stas00

@stas00 datasets cache is another topic! At the moment datasets doesn't use the default cache shared between huggingface_hub, transformers, diffusers, etc. We will have to fix that first before providing a CLI to clean this cache too (cc @lhoestq for visibility).

Wauplin avatar Jan 24 '24 08:01 Wauplin

Do you want me to open an Issue here or on the datasets repo?

stas00 avatar Jan 24 '24 18:01 stas00

Better for the datasets repo but please tag me. Thanks!

Wauplin avatar Jan 24 '24 18:01 Wauplin

Done: https://github.com/huggingface/datasets/issues/6614

stas00 avatar Jan 24 '24 18:01 stas00