huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

[Feedback welcome] CLI to upload arbitrary huge folder

Open Wauplin opened this issue 10 months ago • 18 comments

What for?

Upload arbitrarily large folders in a single command line!

⚠️ This tool is still experimental and is meant for power users. Expect some rough edges in the process. Feedback and bug reports would be very much appreciated ❤️

How to use it?

Install

pip install git+https://github.com/huggingface/huggingface_hub@large-upload-cli

Upload folder

huggingface-cli large-upload <repo-id> <local-path>

Every minute a report is printed to the terminal with the current status. Apart from that, progress bars and errors are still displayed.

Large upload status:
  Progress:
    104/104 hashed files (22.5G/22.5G)
    0/42 preuploaded LFS files (0.0/22.5G) (+4 files with unknown upload mode yet)
    58/104 committed files (24.9M/22.5G)
    (0 gitignored files)
  Jobs:
    sha256: 0 workers (0 items in queue)
    get_upload_mode: 0 workers (4 items in queue)
    preupload_lfs: 6 workers (36 items in queue)
    commit: 0 workers (0 items in queue)
  Elapsed time: 0:00:00
  Current time: 2024-04-26 16:24:25

Run huggingface-cli large-upload --help to see all options.

What does it do?

This CLI is intended to upload arbitrary large folders in a single command:

  • process is split in 4 steps: hash, get upload mode, lfs upload, commit
  • retry on error at each step
  • multi-threaded: workers are managed with queues
  • resumable: if the process is interrupted, you can re-run it. Only partially uploaded files are lost.
  • files are hashed only once
  • starts to upload files while other files are still been hashed
  • commit at most 50 files at a time
  • prevent concurrent commits
  • prevent rate limits as much as possible
  • prevent small commits
  • retry on error for all steps

A .hugginface/ folder will be created at the root of your folder to keep track of the progress. Please do not modify these files manually. If you feel this folder got corrupted, please report it here, delete the .huggingface/ entirely and then restart you command. Some intermediate steps will be lost but the upload process should be able to continue correctly.

Known limitations

  • cannot set a path_in_repo => always upload files at root of the folder. If you want to upload to a subfolder, you need to set the proper structure locally.
  • not optimized for hf_transfer (though it works) => better to set --num-workers to 2 otherwise CPU will be bloated
  • cannot delete files on repo while uploading folder
  • cannot set commit message/commit description
  • cannot create PR by itself => you must first create a PR manually, then provide revision

What to review?

Nothing yet.

For now the goal is to gather as much feedback as possible. If it proves successful, I will clean the implementation and make it more production-ready. Also, this PR is built on top of https://github.com/huggingface/huggingface_hub/pull/2223 that is not merged yet, which makes the changes very long.

For curious people, here is the logic to decide what should be the next task to perform.

Wauplin avatar Apr 26 '24 14:04 Wauplin

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Feedback so far:

  • [x] when connection is slow, better to reduce the number of workers. Should we do that automatically or just print a message? Reducing number of workers might not speed-up upload but at least less files are uploaded in parallel => less chances to loose progress in case of failed upload. EDIT: added to docstring
  • [x] terminal output is too verbose. Might be good to disable individual progress bars?
  • [x] terminal output is awful in a jupyter notebook => how can we make that more friendly (printing a report every 1 minute ends up with very long logs)
  • [x] a CTRL-C (or at most 2 CTRL+C) must stop the process. It's not the case at moment due to all try/except.

EDIT:

  • [x] should print warning when upload parquet/arrow files to a model repository. It is not possible to convert a model to dataset repo afterwards so better being sure.
  • [x] require explicit --repo-type instead of defaulting to model => add low friction and prevent potential reupload (addressed in https://github.com/huggingface/huggingface_hub/pull/2254/commits/ba7f248ff2ca328f7e86ad84d060b07941e14e75)

EDIT:

  • [x] might create some empty commits in some cases (if files already committed). Bad UX if resuming.

Wauplin avatar May 03 '24 07:05 Wauplin

EDIT (from @Wauplin): this comment has been addressed in https://github.com/huggingface/huggingface_hub/pull/2254/commits/ba7f248ff2ca328f7e86ad84d060b07941e14e75


IMO, it would make sense for this not to default to uploading as a model repo i.e. require this:

huggingface-cli large-upload <repo-id> <local-path> --repo-type dataset

If a user runs:

huggingface-cli large-upload <repo-id> <local-path>

they should get an error along the lines of "Please specify the repo type you want to use"

Quite a few people using this tool have accidentally uploaded a dataset to a model repo, and currently, it's not easy to move this to a dataset repo.

I know that many of the huggingface_hub methods/functions default to model repos, but I think that doesn't make sense in this case since:

  • it's more/equally likely to be used for uploading datasets as model weights
  • since the goal is to support large uploads the cost of getting it wrong for the user is quite high

davanstrien avatar May 15 '24 10:05 davanstrien

ah i rather agree with @davanstrien here

julien-c avatar May 15 '24 10:05 julien-c

Can the parameters of "large-upload" be aligned to the "upload"? huggingface-cli large-upload [repo_id] [local_path]

wanng-ide avatar May 16 '24 13:05 wanng-ide

@wanng-ide Agree we should aim for consistency yes. What parameters/options you would specifically change?

So far we have:

$ huggingface-cli large-upload --help
usage: huggingface-cli <command> [<args>] large-upload [-h] [--repo-type {model,dataset,space}]
                                                       [--revision REVISION] [--private]
                                                       [--include [INCLUDE ...]] [--exclude [EXCLUDE ...]]
                                                       [--token TOKEN] [--num-workers NUM_WORKERS]
                                                       repo_id local_path
$ huggingface-cli upload --help 
usage: huggingface-cli <command> [<args>] upload [-h] [--repo-type {model,dataset,space}]
                                                 [--revision REVISION] [--private] [--include [INCLUDE ...]]
                                                 [--exclude [EXCLUDE ...]] [--delete [DELETE ...]]
                                                 [--commit-message COMMIT_MESSAGE]
                                                 [--commit-description COMMIT_DESCRIPTION] [--create-pr]
                                                 [--every EVERY] [--token TOKEN] [--quiet]
                                                 repo_id [local_path] [path_in_repo]

Wauplin avatar May 22 '24 12:05 Wauplin

@wanng-ide Agree we should aim for consistency yes. What parameters/options you would specifically change?

So far we have:

$ huggingface-cli large-upload --help
usage: huggingface-cli <command> [<args>] large-upload [-h] [--repo-type {model,dataset,space}]
                                                       [--revision REVISION] [--private]
                                                       [--include [INCLUDE ...]] [--exclude [EXCLUDE ...]]
                                                       [--token TOKEN] [--num-workers NUM_WORKERS]
                                                       repo_id local_path
$ huggingface-cli upload --help 
usage: huggingface-cli <command> [<args>] upload [-h] [--repo-type {model,dataset,space}]
                                                 [--revision REVISION] [--private] [--include [INCLUDE ...]]
                                                 [--exclude [EXCLUDE ...]] [--delete [DELETE ...]]
                                                 [--commit-message COMMIT_MESSAGE]
                                                 [--commit-description COMMIT_DESCRIPTION] [--create-pr]
                                                 [--every EVERY] [--token TOKEN] [--quiet]
                                                 repo_id [local_path] [path_in_repo]

what about: huggingface-cli large-upload [local_path] [path_in_repo] ADD [path_in_repo]

wanng-ide avatar May 22 '24 13:05 wanng-ide

I'm not sure to understand what's the purpose of the ADD keyword

Wauplin avatar May 22 '24 14:05 Wauplin

Will this only be a cli or also a python function? I liked the python API for upload folder. Convenient to automate sending many datasets in python rather than bash

rom1504 avatar May 29 '24 18:05 rom1504

Will this only be a cli or also a python function?

Yes, that's the goal. At the moment, it is defined as a standalone method large_upload() (see here). In a final version, we will probably add it to HfApi client.

Wauplin avatar May 30 '24 11:05 Wauplin

I'm using it to upload a few 300GB datasets. The standard upload function was taking more than 30min just to hash the files and then was crashing half way in upload. This seems to be working much better.

rom1504 avatar Jun 01 '24 08:06 rom1504

Ok I got one more piece of feedback actually... Looks like this tool is too fast :)

It seems to be killing my box for a few hours (after uploading at 80MB/s for a few hours). I don't really get how that's possible yet.

What would you advise to reduce the speed a bit / reduce the number of simultaneous connections ?

rom1504 avatar Jun 08 '24 21:06 rom1504

Wow, this is an unexpected problem :smile: I can think of two ways of reducing the upload speed:

  1. don't use hf_transfer if you were previously doing it
  2. set --num-workers=1 (or 2/3) to reduce the number of workers uploading files in parallel. However there is currently no way to throttle the connection from huggingface_hub directly. There is a separate issue for that (see https://github.com/huggingface/huggingface_hub/issues/2118#issuecomment-2157587132) but I don't think we'll ever work on such a feature. You can set this from your setup with a proxy I believe, though it's quite hacky.

Wauplin avatar Jun 10 '24 09:06 Wauplin

@Wauplin Just downloaded and run, got the following error:

> huggingface-cli large-upload  hoverinc/mydataset_test data

  File "/home/dmytromishkin/miniconda3/envs/pytorch/bin/huggingface-cli", line 5, in <module>
    from huggingface_hub.commands.huggingface_cli import main
  File "/home/dmytromishkin/big_storage/huggingface_hub/src/huggingface_hub/commands/huggingface_cli.py", line 21, in <module>
    from huggingface_hub.commands.large_upload import LargeUploadCommand
  File "/home/dmytromishkin/big_storage/huggingface_hub/src/huggingface_hub/commands/large_upload.py", line 29, in <module>
    from huggingface_hub.large_upload import large_upload
  File "/home/dmytromishkin/big_storage/huggingface_hub/src/huggingface_hub/large_upload.py", line 513, in <module>
    def _get_one(queue: queue.Queue[JOB_ITEM_T]) -> List[JOB_ITEM_T]:
TypeError: 'type' object is not subscriptable

ducha-aiki avatar Jun 10 '24 12:06 ducha-aiki

@ducha-aiki Which Python version are you using? Could you try to upgrade to 3.10 and let me know if it still happens? I suspect queue.Queue[JOB_ITEM_T] to be forbidden in Python 3.8

Wauplin avatar Jun 10 '24 12:06 Wauplin

Oh no, my favorite 2 yo environment... you got me, that was Python 3.8.12. Trying on 3.10 now, no error, but also nothing got uploaded...the repo is either not created( tried without repo), or empty (when tried to creating repo first)

INFO:huggingface_hub.large_upload:

##########
Large upload status:
  Progress:
    57/57 hashed files (15.5M/15.5M)
    57/57 preuploaded LFS files (15.5M/15.5M)
    57/57 committed files (15.5M/15.5M)
    (0 gitignored files)
  Jobs:
    sha256: 0 workers (0 items in queue)
    get_upload_mode: 0 workers (0 items in queue)
    preupload_lfs: 0 workers (0 items in queue)
    commit: 0 workers (0 items in queue)
  Elapsed time: 0:01:00
  Current time: 2024-06-10 12:44:23
##########

The folder structure I am trying to upload:

data/*.parquet

ducha-aiki avatar Jun 10 '24 12:06 ducha-aiki

@ducha-aiki Are you sure nothing has been created on the Hub? Can you delete the .huggingface/ cache folder that should have been created in the local folder and retry?

Wauplin avatar Jun 10 '24 12:06 Wauplin

@Wauplin update: the files finally appeared now, although they pretend to be added 6 min ago. Anything, everything works now, thank you :)

image

ducha-aiki avatar Jun 10 '24 12:06 ducha-aiki

Thanks to everyone who've tested this feature! :heart: Feedback has been very valuable to improve user experience. I think that I've addressed most of it. I have added some documentation + cleaned a bit the implementation. PR is ready to be reviewed cc @LysandreJik @osanseviero

Test-wise I did not do much. What I can do is to run a large upload and check that all files have been uploaded. But I'm a bit lazy to unittest each part one by one. Let me know what you think :see_no_evil: (edit: added a basic test in https://github.com/huggingface/huggingface_hub/pull/2254/commits/84c65a58f8a9d5989ffb67a5f4c0c7d66bc5dc7e)

Let's finally ship this! :rocket:

Note: oh and last thing, I don't like the name large_upload (neither robust_upload). Any suggestion?

Wauplin avatar Jul 22 '24 15:07 Wauplin

I wonder if we should rename to upload_large_folder

osanseviero avatar Jul 25 '24 11:07 osanseviero

I wonder if we should rename to upload_large_folder

Good idea! Renamed in https://github.com/huggingface/huggingface_hub/pull/2254/commits/0dc048ac85cc60b8b4082851b9d1bf010b3e768e

Wauplin avatar Jul 29 '24 15:07 Wauplin

@LysandreJik @osanseviero Thanks for yours reviews! I've addressed all of your comments. I think we are in good shape to merge it now. Can I get a last approval? :pray: :hugs:

I renamed the command huggingface-cli upload-large-folder and the method HfApi.upload_large_folder as suggested by @osanseviero. And failing tests are unrelated.

Wauplin avatar Jul 29 '24 16:07 Wauplin

image

The code snippet in the upload guide has a typo. It should say api.upload_large_folder not api.upload_folder

ArthurConmy avatar Jul 31 '24 09:07 ArthurConmy

Thanks for noticing and reporting @ArthurConmy! I just fixed it in https://github.com/huggingface/huggingface_hub/pull/2254/commits/ea765be18a1089d57a474e5eab59b4e7789e6a7c

Wauplin avatar Jul 31 '24 09:07 Wauplin

how do i give repo type this command fails and you didn't write in first post :)

!huggingface-cli upload-large-folder "MonsterMMORPG/FLUX_Kohya_SS_Massive_Research_Part5" "/home/Ubuntu/apps/StableSwarmUI/Models/Lora"

print(".\n.\nUPLOAD COMPLETED")

FurkanGozukara avatar Aug 28 '24 11:08 FurkanGozukara

@FurkanGozukara you an use huggingface-cli upload-large-folder --help to learn how to use the CLI. To pass a repo type, you must add --repo-type=dataset for instance.

Wauplin avatar Aug 28 '24 12:08 Wauplin

it prints too many messages too frequently while uploading it even crashed my notebook

can we limit it to display status at the same line or maybe lesser more compact?

it prints like 100 messages like this every second

image

FurkanGozukara avatar Aug 28 '24 13:08 FurkanGozukara

it says uploads completed run triple times but there are no files in repo

You are about to upload a large folder to the Hub using `huggingface-cli upload-large-folder`. This is a new feature so feedback is very welcome!

A few things to keep in mind:
  - Repository limits still apply: https://huggingface.co/docs/hub/repositories-recommendations
  - Do not start several processes in parallel.
  - You can interrupt and resume the process at any time. The script will pick up where it left off except for partially uploaded files that would have to be entirely reuploaded.
  - Do not upload the same folder to several repositories. If you need to do so, you must delete the `./.cache/huggingface/` folder first.

Some temporary metadata will be stored under `/home/Ubuntu/apps/StableSwarmUI/Models/Lora/.cache/huggingface`.
  - You must not modify those files manually.
  - You must not delete the `./.cache/huggingface/` folder while a process is running.
  - You can delete the `./.cache/huggingface/` folder to reinitialize the upload state when process is not running. Files will have to be hashed and preuploaded again, except for already committed files.

For more details, run `huggingface-cli upload-large-folder --help` or check the documentation at https://huggingface.co/docs/huggingface_hub/guides/upload#upload-a-large-folder.
Repo created: https://huggingface.co/datasets/MonsterMMORPG/FLUX_Kohya_SS_Massive_Research_Part5
Found 181 candidate files to upload
Recovering from metadata files: 100%|████████| 181/181 [00:00<00:00, 564.82it/s]
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
All files have been processed! Exiting worker.



---------- 2024-08-28 17:53:59 (0:00:00) ----------
Files:   hashed 181/181 (371.5G/371.5G) | pre-uploaded: 161/161 (371.5G/371.5G) | committed: 181/181 (371.5G/371.5G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------

---------- 2024-08-28 17:54:00 (0:00:01) ----------
Files:   hashed 181/181 (371.5G/371.5G) | pre-uploaded: 161/161 (371.5G/371.5G) | committed: 181/181 (371.5G/371.5G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
INFO:huggingface_hub._upload_large_folder:
---------- 2024-08-28 17:54:00 (0:00:01) ----------
Files:   hashed 181/181 (371.5G/371.5G) | pre-uploaded: 161/161 (371.5G/371.5G) | committed: 181/181 (371.5G/371.5G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
.
.
UPLOAD COMPLETED

FurkanGozukara avatar Aug 28 '24 17:08 FurkanGozukara

can we limit it to display status at the same line or maybe lesser more compact?

@FurkanGozukara yes you can do that by passing --no-bars and --no-reports in the command line. I have added a command to show it more prominently.

Wauplin avatar Aug 29 '24 13:08 Wauplin

it says uploads completed run triple times but there are no files in repo

Have you tried to reupload the same folder to multiple locations? If yes, only the first upload will be correct. As mentioned in the little help section:

  • You can delete the ./.cache/huggingface/ folder to reinitialize the upload state when process is not running. Files will have to be hashed and preuploaded again, except for already committed files.

I suspect that your local metadata says the files are already uploaded. You can delete it and rerun the command.

Wauplin avatar Aug 29 '24 13:08 Wauplin