Checksum validation with hf_hub_download on model files.
Is your feature request related to a problem? Please describe. After reviewing: #1738 and #2223 it looks like file checksums are only computed on the cache dir in specific conditions. Ideally, a user could knowingly force a checksum post download as well as on retrieval from cache to ensure integrity of the files with any usage.
It's possible I misunderstood the code or discussion though.
Describe the solution you'd like
Add an input arg and environment variable to enforce checksums on files for each hf_hub_download call on the retrieved files.
Describe alternatives you've considered Pre-downloading files manually and manually checking file integrity before using the cached files.
Hi @JGSweets, thanks for opening the issue. The 2 PRs you've linked are only related to "downloading to a local directory", not the generic "downloading into the HF cache directory" workflow. If we add such a validation, we would do it for both. The main problem with checking the file integrity after a download is the time it takes to do it:
- I don't want to do it by default, because of the extra overhead it would have for users (computing a sha256 file on GBs of data can takes minutes)
- I'm a bit chill to add a new parameter as it would most likely not be consistently propagated to all libraries using
huggingface_hub - what we could have is to provide an extra utilities that checks the sha256 of files separately. Users wanting extra security would have to execute it before loading files. Would something like this work for you?
- why not an environment variable but I'm worried it would result in a more logic client side for a feature that will most likely be hidden/not used by end users. To be honest I'm still unsure we really need such a feature. We are already checking downloaded filesize to ensure all bytes have been retrieved from server (in case of network issues). Checking checksum is a security extra step to ensure the Hub is not corrupted but that could be checked on our side as well.
cc @Pierrci @julien-c in case you have other opinion
Bit late to this but is there any specific option which can help me do this when using the hf_file_download or snapshot_download?
Problem I am having is, the dataset is too large and sometimes the metered connection cuts things off, I want to start ignoring the files which are already downloaded but I also have to check for corrupted files, is it possible that hf_hub_download can check if a file is in the cache and also check if the shasum matches if not it re downloades the files?
is it possible that hf_hub_download can check if a file is in the cache and also check if the shasum matches if not it re downloades the files?
No we currently don't have such a tool. But if a file is in the cache, it should not be corrupted. Have you ever encountered such an issue or is it theoretical for now? If yes, I'd be curious to have more details about it. In general, files are always downloaded to temporary files and moved to the cache only when fully downloaded. This ensures that connection issues do not corrupt the cache itself.
@Wauplin Consistently encounter corrupted weights in our closed corporate network, which routes traffic through proxies. As a result, have to manually validate the sha256sum each time. It would be nice if this could be done automatically.
see also more recent https://github.com/huggingface/huggingface_hub/issues/3298