No error is thrown when model download fails due to insufficient space
Describe the bug
When downloading a model, the file_download.py file does not throw an error when there is not enough space.
https://github.com/huggingface/huggingface_hub/blob/2702ec2a2bd0124cc1fddfd72ccb1297b2478148/src/huggingface_hub/file_download.py#L651
This is problematic in environments like sglang, where the server does not exit even though the model weights never finish downloading https://github.com/sgl-project/sglang/issues/2801
Reproduction
Download weights on a device without enough space and observe as there is no indication of the model download failure.
Logs
No response
System info
- any version of `huggingface_hub`
At the very least, I think that there should be a flag of some sorts to throw an exception here, or throw an exception completely.
How this impacts sglang: https://github.com/sgl-project/sglang/issues/2801
Hi @atbe, sorry you encountered this issue!
A bit of context : we made the choice to not raise an exception when the user does not have enough disk space to avoid unconditionally blocking downloads in valid setups. In some environments, the data returned by shutil.disk_usage(path).free may not accurately reflect actual space availability. By warning, we allow for more flexibility in these edge cases.
Another alternative is to manually check for sufficient disk space on your side before calling snapshot_download() or any other downloading function. That way you have explicit control over how to handle insufficient space error.
Let me know what do you think.
thanks for the reply @hanouticelina !
Another alternative is to manually check for sufficient disk space on your side before calling snapshot_download() or any other downloading function. That way you have explicit control over how to handle insufficient space error.
I do think its a bit odd that the server doesn't exit when serving legitimately fails (in this case due to insufficient space), don't you feel the same? You could check for space manually, but that just feels like a hack compared to getting the server to properly detect that it failed to start and exiting.
Hi @atbe,
I reproduced the scenario using a container where we call snapshot_download while having not enough disk space for the files we're downloading.
Here is the traceback I get :
Fetching 14 files: 0%| | 0/14 [00:00<?, ?it/s]/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:651: UserWarning: Not enough free disk space to download the file. The expected file size is: 3945.44 MB. The target location /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/blobs only has 3421.60 MB free disk space.
warnings.warn(
/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:651: UserWarning: Not enough free disk space to download the file. The expected file size is: 3864.73 MB. The target location /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/blobs only has 3421.60 MB free disk space.
warnings.warn(
LICENSE: 100%|████████████████████████████████████| 11.3k/11.3k [00:00<00:00, 28.6MB/s]
generation_config.json: 100%|█████████████████████████| 243/243 [00:00<00:00, 5.31MB/s]
.gitattributes: 100%|█████████████████████████████| 1.52k/1.52k [00:00<00:00, 18.4MB/s]
README.md: 100%|██████████████████████████████████| 6.00k/6.00k [00:00<00:00, 80.2MB/s]
config.json: 100%|████████████████████████████████████| 663/663 [00:00<00:00, 6.11MB/s]
model.safetensors.index.json: 100%|████████████████| 27.8k/27.8k [00:00<00:00, 140MB/s]
/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:651: UserWarning: Not enough free disk space to download the file. The expected file size is: 3864.73 MB. The target location /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/blobs only has 3404.50 MB free disk space. | 0.00/27.8k [00:00<?, ?B/s]
warnings.warn(
/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:651: UserWarning: Not enough free disk space to download the file. The expected file size is: 3556.38 MB. The target location /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B-Instruct/blobs only has 3404.50 MB free disk space.
warnings.warn(
tokenizer_config.json: 100%|██████████████████████| 7.30k/7.30k [00:00<00:00, 49.9MB/s]
merges.txt: 100%|█████████████████████████████████| 1.67M/1.67M [00:00<00:00, 4.44MB/s]
vocab.json: 100%|█████████████████████████████████| 2.78M/2.78M [00:00<00:00, 3.05MB/s]
tokenizer.json: 100%|█████████████████████████████| 7.03M/7.03M [00:01<00:00, 5.13MB/s]
model-00002-of-00004.safetensors: 23%|██▋ | 881M/3.86G [00:26<01:30, 33.1MB/s]
model-00004-of-00004.safetensors: 23%|██▊ | 818M/3.56G [00:26<01:27, 31.2MB/s]
model-00001-of-00004.safetensors: 23%|██▋ | 902M/3.95G [00:26<01:29, 33.9MB/s]
Fetching 14 files: 43%|█████████████▋ | 6/14 [00:26<00:35, 4.48s/it]
model-00003-of-00004.safetensors: 22%|██▋ | 849M/3.86G [00:26<01:34, 32.0MB/s]
Traceback (most recent call last): 1%| | 21.0M/3.56G [00:01<02:48, 21.0MB/s]
File "<string>", line 7, in <module>|██▊ | 818M/3.56G [00:26<01:19, 34.3MB/s]
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fnsafetensors: 22%|██▋ | 849M/3.86G [00:26<01:49, 27.4MB/s]
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py", line 296, in snapshot_download
thread_map(
File "/opt/venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/_snapshot_download.py", line 270, in _inner_hf_hub_download
return hf_hub_download(
^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 860, in hf_hub_download
return _hf_hub_download_to_cache_dir(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1009, in _hf_hub_download_to_cache_dir
_download_to_tmp_and_move(
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1543, in _download_to_tmp_and_move
http_get(
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 455, in http_get
temp_file.write(chunk)
OSError: [Errno 28] No space left on device
As you can see, the script does properly signals the failure by raising an OSError: [Errno 28] No space left on device. I'm maybe missing something, but it's the responsability of the server application to catch this error, handle the failure (with custom logging for example) and shutdown the server. We don't throw the error at the beginning for the reasons mentioned in this comment.
Agree with @hanouticelina here. The check you are referring to here is made before actually downloading the file to warn the user early. We don't want to raise an exception at this stage for the reason explained above. But in any case, an exception will be raised when the disk space will actually be used.
FWIW, I agree with @atbe . This is not my application code. I shouldn't have to guess if the error is a false positive (or even care if it is). There should be an option to force exit on disk space error -- even if the default remains the same.