huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

Concurrent downloads when using `snapshot_download`

Open Wauplin opened this issue 2 years ago • 2 comments

At the moment, snapshot_download is downloading every file 1 by 1 which is sub-optimal. Would be good to have concurrent downloads in order to make the most out of a good bandwidth. This can be done either with asyncio or multi-threading.

I made a very first experiment that I would not even call a draft that is pushed on the 1042-concurrent-download-in-snapshot-tests branch. See the diff comparison for an idea of it. I made it run on large datasets that have a lot of small files and it significantly increased the download speed.

For examples, see z-uo/male-LJSpeech-italian, oscar, chrisjay/crowd-speech-africa or from this list (internal link).

Wauplin avatar Sep 08 '22 14:09 Wauplin

That's a great idea!

A couple of notes:

  1. We should keep sync versions for compatibility reasons and maybe mark them as deprecated
  2. To prevent maintaining multiple versions of the same code, we could wrap and call asynchronous functions from sync versions(something like async_to_sync from django/asgiref)

If nobody is working on this, I will try to create a draft PR this week.

qweryty avatar Sep 19 '22 07:09 qweryty

Hi @qweryty, thanks for proposing you on this issue !

We should keep sync versions for compatibility reasons and maybe mark them as deprecated

I definitely agree with you on that ! I think we should even keep both sync and async methods everywhere without deprecation as a lot of users are only used to sync and it's really fine in a lot of cases. We could have a naming such as snapshot_download => async_snapshot_download

To prevent maintaining multiple versions of the same code, we could wrap and call asynchronous functions from sync versions

Yes, good idea as it would be a nightmare to test and maintain otherwise :D Let's keep in mind that we try as much as possible to avoid additional third-party dependency in huggingface_hub if a library can be replaced in a few lines of code -not sure here what's best, I haven't investigated asgiref-.

If nobody is working on this, I will try to create a draft PR this week.

Nobody's working on it :) In the diff comparison that I suggested in the description of this message I really went for the most straightforward approach just to see if it even made sense to invest time on making concurrent calls. I don't think it's worth it to start your work from this branch. A big change that asyncio brings is the transition from requests to httpx (see compatibility guide). Please let me know if you need any help or if you have any questions.

Wauplin avatar Sep 19 '22 08:09 Wauplin