transformers
transformers copied to clipboard
Error while moving model to GPU `NotImplementedError: Cannot copy out of meta tensor; no data!`
System Info
-
transformers
version: 4.40.0 - Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.22.2
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- Tensorflow version (GPU?): 2.16.1 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.2 (cpu)
- Jax version: 0.4.26
- JaxLib version: 0.4.21`
Who can help?
@ArthurZucker
@sgugger since I see some implementations of this inside accelerate
to skip initialization.
Reproduction
c = LlamaConfig(<path to config.json>)
with torch.device('meta'):
m = LlamaForCausalLM(c)
w = torch.load(<path to weights.bin file>)
m.load_state_dict(w, assign=True)
m.to("cuda:0") //throws error
The last line throws the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/goelayus/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2692, in to
return super().to(*args, **kwargs)
File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 849, in _apply
self._buffers[key] = fn(buf)
File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
Expected behavior
The model should be copied to the GPU device.
To add to the above, if i use init_empty_weights
from accelerate
I can skip the initialization without any errors.
Wondering what is the difference between the two? Also if it is possible to achieve the same using the torch.device('meta')
context manager.
Mmmm could you make sure that the map_location
is correct?
This might be expected, cc @SunMarc WDYT?
So this issue seems to be documented in the code itself big_modeling.py, turns out you can't run model.to
when using the meta
device. I was hoping for some kind of explanation as to why is that the case?
(hence tagged @sgugger since the big_modeling.py
file seems to be often modified by them)
Also if you notice my comment from above, replacing torch.device('meta')
with init_empty_weights
from the accelerate
package seems to resolve the issue.
cc @muellerzr for the accelerate related stuff rather than Sylvain!
Hi @goelayu, this is expected since with torch.device('meta')
also puts the buffers on the meta
device. However, non persistant buffers are not saved in the state_dict
. So, in the case of a llama model where we do have non persistant buffers, you get an error after loading the weights With init_empty_weights
, by default, we don't put the buffer on the meta
device. This is why it is working. Hope it is clearer !
@SunMarc thanks for the response, that answers my question.