huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

huggingface-cli downloads leading to excessive filesystem fragmentation on Windows

Open drhead opened this issue 8 months ago • 2 comments

Describe the bug

For some reason I appear to be getting excessive fragmentation from freshly downloaded files:

Image

This is right after a defrag and consolidation. I know for a fact that there were enough contiguous free regions on disk for every file in this dataset to be stored contiguously, but that doesn't seem to be happening.

If the downloader isn't pre-allocating space for the whole file before writing it out, then I would figure that that is probably why it's doing this. When looking at disk activity while it's writing files, I do observe that disk queue lengths are longer than I would expect when writing a single large file (as in they're above 1 during file writes).

Reproduction

Easiest way to measure the issue: first, run defrag <drive letter>: /A /V and note the fragmented space percentage/number of fragmented files to get a baseline.

Download any dataset containing modestly large files on Windows, probably best if the drive has been used and has had files deleted on it so it has several non-contiguous large chunks of free space. in my case, I downloaded https://huggingface.co/datasets/dalle-mini/open-images, which ended up with files being fragmented into dozens or hundreds of pieces.

Afterwards, run defrag <drive letter>: /A /V again and note that the number of fragmented files probably increased.

Logs


System info

- huggingface_hub version: 0.29.1
- Platform: Windows-10-10.0.22631-SP0
- Python version: 3.10.8
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: C:\Users\<snip>\.cache\huggingface\token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: 2.12.0
- Torch: 2.6.0+cu126
- Jinja2: 3.1.2
- Graphviz: N/A
- keras: 2.12.0
- Pydot: N/A
- Pillow: 9.0.0
- hf_transfer: 0.1.9
- gradio: 3.31.0
- tensorboard: N/A
- numpy: 1.23.5
- pydantic: 1.10.2
- aiohttp: 3.8.1
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: C:\Users\<snip>\.cache\huggingface\hub
- HF_ASSETS_CACHE: C:\Users\<snip>\.cache\huggingface\assets
- HF_TOKEN_PATH: C:\Users\<snip>\.cache\huggingface\token
- HF_STORED_TOKENS_PATH: C:\Users\<snip>\.cache\huggingface\stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

drhead avatar Apr 28 '25 00:04 drhead

Hi @drhead 👋 Thanks for raising this.

huggingface_hub streams each download in small chunks and appends them to a temp file, and because the file keeps growing incrementally, Windows keeps adding each new bit of data in the first free spot it finds. This is why you're seeing many fragments. Could you tell us what concrete side-effects you’re seeing from the fragmentation?, i.e. do you notice a read speed drop? Also, I'm curious to know if this is on a SSD drive. If i'm not mistaken, if it's an SSD, fragmentation is suppose to be basically free there.

hanouticelina avatar Apr 28 '25 14:04 hanouticelina

Yes, this is a magnetic HDD. The severely fragmented files are indeed much slower to read, and having these files written out in chunks further ends up breaking up the contiguous space on the drive, which (on a nearly full drive) means that 1) the fragmentation will get progressively worse on newly-downloaded files as the drive gets filled, and 2) any other files I download or create later will also be fragmented regardless of whether space is preallocated because there's simply no contiguous space left.

I did manage to work around some of this in the mean time by using a downloader script that preallocates by seeking to what should be the end of the file (according to Content-Length) and writing a single byte to it, and sure enough, every file downloaded using that script was completely contiguous. I suspect that there may be better, platform-specific ways to preallocate, but this method works as a platform-independent example.

drhead avatar Apr 28 '25 15:04 drhead