transformers Sudden random bug

System Info

Here is the bug


  File "/home/suryahari/Vornoi/QA.py", line 5, in <module>
    model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/suryahari/miniconda3/envs/diffusers/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/suryahari/miniconda3/envs/diffusers/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2629, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/suryahari/miniconda3/envs/diffusers/lib/python3.11/site-packages/transformers/modeling_utils.py", line 447, in load_state_dict
    with safe_open(checkpoint_file, framework="pt") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: No such device (os error 19)

transformers version: 4.31.0
Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.35
Python version: 3.11.4
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.0.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes but can avoid
Using distributed or parallel set-up in script?: not really

Who can help?

@Narsil ? @younesbelkada @ArthurZucker @amyeroberts

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Create a new env and run the following code


# Load model directly
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

Also happened to me while running diffusers code, just posting QA code for now.

Expected behavior

should be able to load a model

Jul 28 '23 23:07 surya-narayanan

I can't really reproduce this and have not seen this anywhere else. The OS Error suggests that the interface is not available, meaning that most probably the path to your hugging face cache cannot be reached (not mounted/ not right etc). A simple reproducer available here.

Jul 31 '23 08:07 ArthurZucker

I've seen that happen when using network mounted disks.

If the network is flaky then the read might fail even though the rest went fine. Error should be transient though. Could that be it ?

Jul 31 '23 16:07 Narsil

Not sure - the program fails even on a new env on my computer but works in google colab. @ArthurZucker the link you sent has permission issues.

Jul 31 '23 20:07 surya-narayanan

I've seen that happen when using network mounted disks.

If the network is flaky then the read might fail even though the rest went fine. Error should be transient though. Could that be it ?

We hit the same issue, are there any other reason that probably causes this issue except network fluctuation? Thanks! @Narsil

Jan 18 '24 09:01 YichuanSun

Have you solved this problem? Why closed this issue? Thanks! @surya-narayanan

Jan 18 '24 09:01 YichuanSun

Did not solve this problem but experienced this bug again today only to discover that it was one I had raised way back lol.

Feb 15 '24 21:02 surya-narayanan

I am having the same issue when trying to load a local checkpoint

model = AutoModelForCausalLM.from_pretrained( "./training_checkpoints/new_model", quantization_config=quant_config, device_map='cuda:0' )

New_model contains the screenshot attached Screenshot 2024-03-06 at 8 59 59 AM

Mar 06 '24 17:03 c3-moutasem

I think it comes up because of permissions issues - i.e. if you're not sudo on your machine and the program has to write to /tmp/ or something.

Mar 13 '24 23:03 surya-narayanan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 07 '24 08:04 github-actions[bot]

We hit this issue for a FUSE mounted client. The problem is that the FUSE mounted client didn't support mmap, which this library uses to read the .safetensors file.

Apr 08 '24 07:04 jimdowling

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 02 '24 08:05 github-actions[bot]

I had a similar problem on a lambdalabs mounted storage when loading an LLM adapter. I moved the adapter out of the mounted storage (towards ~). Then all worked fine !

Aug 04 '24 23:08 Mustapha-AJEGHRIR

I had a similar problem on a lambdalabs mounted storage when loading an LLM adapter. I moved the adapter out of the mounted storage (towards ~). Then all worked fine !

Yes, it appears that safetensors model files stored in certain mounted directories, such as Ceph, may encounter this issue. Copying the model to the local hard drive allows it to load correctly. Additionally, saving it as a .bin model file does not encounter this issue.

Aug 05 '24 02:08 jiahuanluo

transformers transformers copied to clipboard

Sudden random bug

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard