hf_transfer hf_transfer download sometimes hangs without failing or succeeding

System Info

hf_transfer = 0.1.6 huggingface_hub = 0.21.4

Reproduction

We have not been able to reproduce and it is hard to really give details, since it happens very rarely in a system that deletes the downloaded file after using it, so no real way to know the state it leaves the machine in. But the behaviour is that we will start a download and it just will never return, the callback is not called either, nothing.

Eventually our system closes the process and we do not have visibility into it other than the logs it produces. The lib produces none logs, only some logs from our code:

Downloading https://xxx.r2.cloudflarestorage.com/yyy to /tmp/file

and there it hangs

Expected behavior

It should either fail or succeed, but not hang

Thanks for this amazing project, by the way. The speed is just amazing! I wish I was able to give more hints, but it really shows us nothing.

Mar 08 '24 19:03 chamini2

This sometimes happens for me as well. Maybe 1% of downloads. Makes huggingface-cli freeze up, have to kill the process to retry.

Mar 30 '24 01:03 aikitoria

I moved away from using this tool and some times I have slow downloads from hugging face. So maybe it is HF and not the tool.

Mar 30 '24 20:03 chamini2

Happened again. Froze with a file at 99% and locked up the whole session. Had to switch to another and kill the process to retry, which immediately worked.

Apr 10 '24 01:04 aikitoria

Could be a lot of different things, will need more information to identify what the issue is. Is your network connectivity stable? Having a reproducible example would help a lot.

May 29 '24 16:05 McPatate

As far as I can tell, there is no way to reproduce it on purpose. Even trying the same file again right after it happens usually works totally fine.

May 29 '24 16:05 aikitoria

I actually think it may be huggingface rate limiting, since I saw it happen for other tools too.

Jun 03 '24 20:06 chamini2

I just set a timeout for the download and retry

Jun 03 '24 20:06 chamini2

@chamini2 that would definitely explain the issue.

Jun 04 '24 08:06 McPatate

Would it? That seems like a bug. If HF has a rate limit, then an official tool from HF should either be aware of the rate limit and automatically stay under it, or detect the rate limit response and automatically retry.

Also, it wouldn't explain why the download stops at exactly 99%.

Jun 04 '24 08:06 aikitoria

@aikitoria cf the Disclaimer. hf_transfer was meant to be used internally at first, where rate limit is not an issue.

If you want to reduce the number of requests, I would play with the following two settings:

    chunk_size=CHUNK_SIZE,
    max_files=64,

max_files is the number of concurrent chunks that are downloaded at once, and chunk_size is the size in bytes of a chunk. Ofc, 10Mib seems like a sweet spot, but increasing this number will reduce the total number of calls required to download the file and thus reduce chances of getting rate limited. Similarly, reducing the number of concurrent chunk downloads will delay hitting the rate limit.

Download speed will probably suffer a bit, but I don't think it'll be too bad.

We could detect we are getting a 429 response code and throttle ourself, but I'm not sure this really is in scope for this project. cc @Narsil wdyt?

Jun 04 '24 08:06 McPatate

Also, it wouldn't explain why the download stops at exactly 99%.

It'd make sense that download would stop at higher percentages given a lot of requests are emitted at once, depending on the file size though. As previously mentioned, we would need more information to debug this issue correctly.

Jun 04 '24 08:06 McPatate

How can we collect this information, given the issue seems to happen randomly with a low chance? I still found no trick to it. It actually hasn't happened in a bit.

Are there some verbose debug logs we can enable all the time - and then grab if it happens?

Jun 04 '24 08:06 aikitoria

Sadly we don't have any tracing related to hf_transfer, you will have to add it yourself. Are you using the python bindings or rust functions directly?

If you add tracing to hf_transfer, you will also need to create a Subscriber using tracing-subscriber.

This way, you can set RUST_LOG=trace <command> <to> <run> <my> <download> for instance and you will see a lot of info about what is going on.

Jun 04 '24 09:06 McPatate

We could detect we are getting a 429 response code and throttle ourself, but I'm not sure this really is in scope for this project. cc @Narsil wdyt?

Definitely feels out of scope to me. (also what do you do, and how much do you throttle yourself ?, I'm not sure there are good answers here)

Jun 04 '24 09:06 Narsil

I don't see how it could be out of scope. hf_transfer is the code that receives the rate limit response for a chunk, so the only reasonable option is to implement it there (you don't want to re-download the entire file!).

Usually 429 includes metadata on when to retry. If not, infinite retries (until user cancel) with exponential backoff always works.

Jun 04 '24 10:06 aikitoria

I don't see how it could be out of scope. hf_transfer is the code that receives the rate limit response for a chunk, so the only reasonable option is to implement it there (you don't want to re-download the entire file!).

Let's agree to disagree. Being rate limited means you've been abusing. Stopping hammering upstream is also a perfectly valid strategy (by crashing or any other strategy). Sure it feels bad when you're only missing a few chunks, but maybe you're missing the entire file and you should fix the code calling the download in the first place. Like maybe adding some kind of authentification to boost your rate limits, or using a different point of origin.

429 shouldn't be part of "normal" operations, and it should be up to clients to choose their strategy on how to fix, not this lib.

Jun 04 '24 15:06 Narsil

Being rate limited means you've been abusing

I don't understand how you came to this conclusion. All I'm doing is using huggingface_cli with hf_transfer installed to download a single model from Huggingface using the command shown in the documentation.

fix the code calling the download in the first place

But... you guys made this code! For your own site!

adding some kind of authentification to boost your rate limits

Is this a thing? It wasn't mentioned in the documentation. Does logging in using your hf token fix it? I'll have to try that.

Jun 04 '24 15:06 aikitoria

@aikitoria cf reqwest's (an http lib for rust) standpoint on handling 429s: https://github.com/seanmonstar/reqwest/issues/169

Whilst scopes are different for hf_transfer and reqwest, I can understand @Narsil's opinion that it is not in scope of the lib and should be handled user side as it is a 4xx status code (a client error code). EDIT: well, I guess my argument is moot, were it a 400 Bad Request I wouldn't be saying the same 😅 It's tricky to find where the responsibility stops for the lib.

I personally think that the ergonomics of hf_transfer could be better when handling failure, transient errors do happen, rate-limiting being an example of one of these cases.

Perhaps we should explore resumability of download for hf_transfer.

Jun 05 '24 07:06 McPatate

I think it would make much more sense to handle internally in this case (than reqwest) because you are splitting up the requests, so it's possible for a single chunk to fail without invalidating the entire download. That can't be handled sensibly externally. Neither can dynamically adjusting how many chunks are downloaded in parallel to not hit rate limits (such as by doing exponential backoff for each chunk request).

Though we don't actually have any proof yet that the problem was really caused by rate limits in the first place, so maybe this entire discussion is moot. Will need to see if I can get that custom build with trace enabled going in case it ever happens again.

I'm not really sure how I'd actually get huggingface-cli to use a custom build of hf_transfer. Currently I just install both from pip.

Jun 05 '24 07:06 aikitoria

@aikitoria @chamini2 Hi, after conducting some tests, I believe this issue is not related to rate limiting but rather to the codebase itself. I have been experiencing frequent hangs lately and discovered that it is related to the error handling of this library. They always attempt to create a Python Exception instance even when the caller is a pure Rust function, and then try to convert it to a string when handling it (Displaying a Python object is quite resource-intensive due to the Global Interpreter Lock (GIL)) , which can lead to a deadlock.

I have posted a pull request #49 , but it has not been reviewed yet. Therefore, you can download the wheel from the release at https://github.com/lx200916/hf_transfer/tree/NoPyException and test it out!

Sep 05 '24 02:09 lx200916

Nice find! For some reason, I haven't had the issue in a couple months on the servers I use, so I'm unable to test whether this fixes it. No idea what changed.

Sep 05 '24 14:09 aikitoria

@aikitoria @chamini2 Hi, after conducting some tests, I believe this issue is not related to rate limiting but rather to the codebase itself. I have been experiencing frequent hangs lately and discovered that it is related to the error handling of this library. They always attempt to create a Python Exception instance even when the caller is a pure Rust function, and then try to convert it to a string when handling it (Displaying a Python object is quite resource-intensive due to the Global Interpreter Lock (GIL)) , which can lead to a deadlock.

I have posted a pull request #49 , but it has not been reviewed yet. Therefore, you can download the wheel from the release at https://github.com/lx200916/hf_transfer/tree/NoPyException and test it out!

it didn't work for me, still gets stuck at 100%.

Sep 05 '24 22:09 kevinNejad

https://github.com/PyO3/pyo3/discussions/3732

Nov 21 '24 14:11 fourdim

https://github.com/PyO3/pyo3/issues/1305

TLDR: tokio multithreading will cause a GIL deadlock.

The GIL is acquired by accessing the callback parameter.

Nov 21 '24 16:11 fourdim

(you don't want to re-download the entire file!).

I have downloaded the file 100 times already and it has never reached 100%. At the same time, the file was downloaded through the browser without problems. How can I disable this non-working code? Or where should I put the already downloaded file so that the program does not try to download it again? How can I disable this download? This is not written anywhere.

Dec 14 '24 15:12 KLL535

How can I disable this download? This is not written anywhere.

This library hsould never run without you taking specific steps to activate it. Just remove whatever is opting in this library and everything should work fine.

There is a big disclaimer in this library about this : https://github.com/huggingface/hf_transfer/?tab=readme-ov-file#disclaimer

Dec 23 '24 15:12 Narsil

How can I disable this download? This is not written anywhere.

This library hsould never run without you taking specific steps to activate it. Just remove whatever is opting in this library and everything should work fine.

There is a big disclaimer in this library about this : https://github.com/huggingface/hf_transfer/?tab=readme-ov-file#disclaimer

Again, you did not write which command to enter and how to disable it. So far, this "transfer" feels like it's been sewn into the code and the only way to fix it is to download the files in the browser.

Dec 23 '24 22:12 KLL535

Again, you did not write which command to enter and how to disable it. So far, this "transfer" feels like it's been sewn into the code and the only way to fix it is to download the files in the browser.

You didn't enter which command/code you're using to use this library either, how can we help you remove it ? Only thing I can say, is that if something is enabling this library by default they are doing it wrong.

For the hanging, I've found and fixed 1 location where the GIL was locking itself (although it's hard to reproduce unfortunately and it was linked to Python errors taking the locks)

Dec 30 '24 10:12 Narsil

Again, you did not write which command to enter and how to disable it. So far, this "transfer" feels like it's been sewn into the code and the only way to fix it is to download the files in the browser.

You didn't enter which command/code you're using to use this library either, how can we help you remove it ? Only thing I can say, is that if something is enabling this library by default they are doing it wrong.

For the hanging, I've found and fixed 1 location where the GIL was locking itself (although it's hard to reproduce unfortunately and it was linked to Python errors taking the locks)

There is a problem in the program fluxgym The problem is not only me. https://github.com/cocktailpeanut/fluxgym/issues/115 https://github.com/cocktailpeanut/fluxgym/issues/264 https://github.com/pytorch/torchtune/pull/2046 And there were many more similar posts.

I tried set HF_HUB_ENABLE_HF_TRANSFER=0 But it didn't help.

I tried pip uninstall hf-transfer But after that the program doesn't work

hf-transfer it's sewn in tightly and I couldn't turn it off The problem is that at a certain percentage the counter simply stops forever. Resuming the download does not work, that is, having downloaded the file to 99%, when restarting it starts downloading again from 0. And so every time. At the same time, everything is downloaded through the browser without problems.

Dec 30 '24 17:12 KLL535

@KLL535 that means you need to open an issue in https://github.com/cocktailpeanut/fluxgym to ask them to remove this library or to offer a way to not use it.

This repo cannot do anything about that.

Jan 07 '25 14:01 chamini2