huggingface_hub
huggingface_hub copied to clipboard
Concurrent downloads when using `snapshot_download`
At the moment, snapshot_download
is downloading every file 1 by 1 which is sub-optimal. Would be good to have concurrent downloads in order to make the most out of a good bandwidth. This can be done either with asyncio
or multi-threading.
I made a very first experiment that I would not even call a draft that is pushed on the 1042-concurrent-download-in-snapshot-tests
branch. See the diff comparison for an idea of it. I made it run on large datasets that have a lot of small files and it significantly increased the download speed.
For examples, see z-uo/male-LJSpeech-italian, oscar, chrisjay/crowd-speech-africa or from this list (internal link).
That's a great idea!
A couple of notes:
- We should keep sync versions for compatibility reasons and maybe mark them as deprecated
- To prevent maintaining multiple versions of the same code, we could wrap and call asynchronous functions from sync versions(something like async_to_sync from django/asgiref)
If nobody is working on this, I will try to create a draft PR this week.
Hi @qweryty, thanks for proposing you on this issue !
We should keep sync versions for compatibility reasons and maybe mark them as deprecated
I definitely agree with you on that ! I think we should even keep both sync
and async
methods everywhere without deprecation as a lot of users are only used to sync
and it's really fine in a lot of cases. We could have a naming such as snapshot_download
=> async_snapshot_download
To prevent maintaining multiple versions of the same code, we could wrap and call asynchronous functions from sync versions
Yes, good idea as it would be a nightmare to test and maintain otherwise :D Let's keep in mind that we try as much as possible to avoid additional third-party dependency in huggingface_hub
if a library can be replaced in a few lines of code -not sure here what's best, I haven't investigated asgiref
-.
If nobody is working on this, I will try to create a draft PR this week.
Nobody's working on it :) In the diff comparison that I suggested in the description of this message I really went for the most straightforward approach just to see if it even made sense to invest time on making concurrent calls. I don't think it's worth it to start your work from this branch.
A big change that asyncio
brings is the transition from requests
to httpx
(see compatibility guide). Please let me know if you need any help or if you have any questions.