transformers RuntimeError: unable to open file when calling from_pretrained on multiple processes after upgrading hugginface

System Info

- `transformers` version: 4.40.0
- Platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Huggingface_hub version: 0.23.1
- Safetensors version: 0.4.3
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes, 8xH100
- Using distributed or parallel set-up in script?: Yes, PyTorch DDP

Who can help?

No response

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

def get_vit(self):
        from transformers import Dinov2Model, Dinov2Config
        return Dinov2Model.from_pretrained("facebook/dinov2-large")

Run get_vit() with torchrun with multiple nodes and multiple GPUs per node, the code would fail after downloading, saying

/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "/root/.../train.py", line 143, in main
    model = Wrapper().cuda(local_rank)
  File "/root/.../train.py", line 134, in __init__
    self.module = DVIM()
  File "/root/.../model/net.py", line 235, in __init__
    dino = self.get_vit()
  File "/root/.../model/net.py", line 311, in get_vit
    return Dinov2Model.from_pretrained("facebook/dinov2-large")
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3456, in from_pretrained
    with safe_open(resolved_archive_file, framework="pt") as f:
RuntimeError: unable to open file </root/.cache/huggingface/hub/models--facebook--dinov2-large/snapshots/47b73eefe95e8d44ec3623f8890bd894b6ea2d6c/model.safetensors> in read-only mode: No such file or directory (2)

The same code was working with the following environment:

- `transformers` version: 4.40.0
- Platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Huggingface_hub version: 0.22.2
- Safetensors version: 0.4.3
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes, PyTorch DDP

Expected behavior

The code should work.

May 24 '24 20:05 eliphatfs

I also face the same issue with the same version of huggingface-hub

May 27 '24 09:05 mkucha

cc @Wauplin

May 27 '24 12:05 LysandreJik

Have you tried upgrading huggingface_hub? got my bug fixed doing so

May 29 '24 15:05 Davido111200

Hi @eliphatfs and @mkucha, sorry you're facing this issue. Is it still happening? If yes, could you copy-paste the output of huggingface-cli env? It can help investigation your issue.

Also

May 30 '24 11:05 Wauplin

Hi @Wauplin , I suspect the cause of the problem is as follows:

Unable to find the model file in the cache_dir
The file is downloaded from the hub by the fastest running process
Other processes attempt to load the file that has not yet been completely downloaded

I can avoid this problem as follows:

import os
from transformers import AutoModel
from torch import distributed as dist


def get_vit():
    local_rank = int(os.environ.get("LOCAL_RANK"))
    if local_rank == 0:
        model = AutoModel.from_pretrained("facebook/dinov2-large")
    dist.barrier()
    if local_rank != 0:
        model = AutoModel.from_pretrained("facebook/dinov2-large")

dist.init_process_group("nccl")
get_vit()

- huggingface_hub version: 0.23.3
- Platform: Linux-4.19.93-1.nbp.el7.x86_64-x86_64-with-glibc2.27
- Python version: 3.10.8
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 1.13.1
- Jinja2: 2.11.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 9.3.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.24.3
- pydantic: N/A
- aiohttp: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /root/.cache/huggingface/hub
- HF_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

Jun 12 '24 09:06 leeesangwon

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 02 '24 08:08 github-actions[bot]

Hi @eliphatfs and @mkucha, sorry you're facing this issue. Is it still happening? If yes, could you copy-paste the output of huggingface-cli env? It can help investigation your issue.

Also

I think i already included it in the original comment.

Aug 03 '24 02:08 eliphatfs

@Wauplin just chiming in to say that this error hasn't gone away even with the latest huggingface_hub patch release :

System Info

Here's my huggingface-cli env :

- huggingface_hub version: 0.24.5
- Platform: Linux-6.5.0-1023-aws-x86_64-with-glibc2.35
- Python version: 3.11.9
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/ray/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.4.0
- Jinja2: 3.1.2
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.3.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: 2.6.2.2
- numpy: 1.26.4
- pydantic: 2.5.0
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/ray/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/ray/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/ray/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

Steps to reproduce

You can simply run a accelerate example with num_processes > 1 : https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py

The issue is most certainly due to a race condition as it only happens with multi-process code.

Short term fix

Also, for anyone else facing the issue, a simple solution is to download the model snapshot beforehand in a thread safe manner, and then perform your model instantiation. This will work for all distributed training settings including Deepspeed-ZeRO-3. Note that with ZeRO-3 there's an implicit barrier somewhere inside HF code during .from_pretrained so model instantiation for all processes should happen together (i.e you can't use accelerate.main_process_first() or even protect it with a file lock)

So you would do

+ from pathlib import Path
+ from huggingface_hub import snapshot_download
+  lock_path = Path(f"~/.cache/huggingface/hub/{MODEL_ID.replace('/', '--')}.lock").expanduser()
+  lock_path.parent.mkdir(parents=True, exist_ok=True)
+  with FileLock(lock_path):
+        model_path = snapshot_download(repo_id=model_id)
- model = AutoModelForCausalLM.from_pretrained(model_id) 
+ model = AutoModelForCausalLM.from_pretrained(model_path)

Aug 04 '24 21:08 SumanthRH

Do we know what might be the root cause ? The barrier should exist before, not sure for which package update causing this race condition issue.

Hi @Wauplin , I suspect the cause of the problem is as follows:

Unable to find the model file in the cache_dir
The file is downloaded from the hub by the fastest running process
Other processes attempt to load the file that has not yet been completely downloaded

I can avoid this problem as follows:

import os
from transformers import AutoModel
from torch import distributed as dist


def get_vit():
    local_rank = int(os.environ.get("LOCAL_RANK"))
    if local_rank == 0:
        model = AutoModel.from_pretrained("facebook/dinov2-large")
    dist.barrier()
    if local_rank != 0:
        model = AutoModel.from_pretrained("facebook/dinov2-large")

dist.init_process_group("nccl")
get_vit()

- huggingface_hub version: 0.23.3
- Platform: Linux-4.19.93-1.nbp.el7.x86_64-x86_64-with-glibc2.27
- Python version: 3.10.8
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 1.13.1
- Jinja2: 2.11.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 9.3.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.24.3
- pydantic: N/A
- aiohttp: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /root/.cache/huggingface/hub
- HF_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

ok, with more tests with suggestions above, especially from @leeesangwon. It seems the race conditions occur only for large processes number, for example 8 processes. @Wauplin any follow-up on the root cause for the issue ? Would it be possibly reasonable to add such tests ?

Aug 14 '24 18:08 Victordongy

Pretty sure this is linked / related to ##27421 and seems like it would be nice to have a fix for that!

Aug 26 '24 13:08 ArthurZucker

Still impacted by this issue.

Aug 27 '24 21:08 jiagaoxiang

(I pinged @Wauplin I think he is having a look!)

Aug 28 '24 09:08 ArthurZucker

The error seems to be caused by the pipeline is trying to access the model without waiting for the pipeline to finish downloading the model.

Aug 28 '24 16:08 jiagaoxiang

Hey folks, sorry for the inconvenience. For some reason the lockfile is not working as expected. Investigating it is not easy but I'll make sure we find a way to definitely fix this :crossed_fingers:

Aug 30 '24 14:08 Wauplin

~Hey everyone, I have not been able to reproduce the error but can someone try to run this script (or something similar) on multiple processes:~

(...)  # non-relevant

~It mimics what is supposed to happen when downloading a file on several processes. If this script fails the same way, that would be a win to help investigate things. In theory the filelock prevents any concurrent access but on multi-gpu multi-processes use cases, that might be causing problems.~

~Thanks in advance!~

Sep 04 '24 16:09 Wauplin

Hi there, sorry for the long delay. We've finally been able to track down and fix the race-condition issue! You can find the fix in the last patch release huggingface_hub==0.24.7. More details in https://github.com/huggingface/huggingface_hub/releases/tag/v0.24.7 :hugs:

Sep 12 '24 09:09 Wauplin

transformers
transformers copied to clipboard

RuntimeError: unable to open file when calling from_pretrained on multiple processes after upgrading hugginface_hub to 0.23.1

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

System Info

Steps to reproduce

Short term fix

transformers transformers copied to clipboard

RuntimeError: unable to open file when calling from_pretrained on multiple processes after upgrading hugginface_hub to 0.23.1

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

System Info

Steps to reproduce

Short term fix

transformers
transformers copied to clipboard