huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

Checksum validation with hf_hub_download on model files.

Open JGSweets opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe. After reviewing: #1738 and #2223 it looks like file checksums are only computed on the cache dir in specific conditions. Ideally, a user could knowingly force a checksum post download as well as on retrieval from cache to ensure integrity of the files with any usage.

It's possible I misunderstood the code or discussion though.

Describe the solution you'd like Add an input arg and environment variable to enforce checksums on files for each hf_hub_download call on the retrieved files.

Describe alternatives you've considered Pre-downloading files manually and manually checking file integrity before using the cached files.

JGSweets avatar Jul 01 '24 15:07 JGSweets

Hi @JGSweets, thanks for opening the issue. The 2 PRs you've linked are only related to "downloading to a local directory", not the generic "downloading into the HF cache directory" workflow. If we add such a validation, we would do it for both. The main problem with checking the file integrity after a download is the time it takes to do it:

  • I don't want to do it by default, because of the extra overhead it would have for users (computing a sha256 file on GBs of data can takes minutes)
  • I'm a bit chill to add a new parameter as it would most likely not be consistently propagated to all libraries using huggingface_hub
  • what we could have is to provide an extra utilities that checks the sha256 of files separately. Users wanting extra security would have to execute it before loading files. Would something like this work for you?
  • why not an environment variable but I'm worried it would result in a more logic client side for a feature that will most likely be hidden/not used by end users. To be honest I'm still unsure we really need such a feature. We are already checking downloaded filesize to ensure all bytes have been retrieved from server (in case of network issues). Checking checksum is a security extra step to ensure the Hub is not corrupted but that could be checked on our side as well.

cc @Pierrci @julien-c in case you have other opinion

Wauplin avatar Jul 02 '24 10:07 Wauplin

Bit late to this but is there any specific option which can help me do this when using the hf_file_download or snapshot_download?

Problem I am having is, the dataset is too large and sometimes the metered connection cuts things off, I want to start ignoring the files which are already downloaded but I also have to check for corrupted files, is it possible that hf_hub_download can check if a file is in the cache and also check if the shasum matches if not it re downloades the files?

berserker1 avatar Apr 30 '25 01:04 berserker1

is it possible that hf_hub_download can check if a file is in the cache and also check if the shasum matches if not it re downloades the files?

No we currently don't have such a tool. But if a file is in the cache, it should not be corrupted. Have you ever encountered such an issue or is it theoretical for now? If yes, I'd be curious to have more details about it. In general, files are always downloaded to temporary files and moved to the cache only when fully downloaded. This ensures that connection issues do not corrupt the cache itself.

Wauplin avatar Apr 30 '25 08:04 Wauplin

@Wauplin Consistently encounter corrupted weights in our closed corporate network, which routes traffic through proxies. As a result, have to manually validate the sha256sum each time. It would be nice if this could be done automatically.

hadipash avatar Aug 11 '25 10:08 hadipash

see also more recent https://github.com/huggingface/huggingface_hub/issues/3298

julien-c avatar Aug 14 '25 06:08 julien-c