transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Error while moving model to GPU `NotImplementedError: Cannot copy out of meta tensor; no data!`

Open goelayu opened this issue 9 months ago • 4 comments

System Info

  • transformers version: 4.40.0
  • Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.1
  • Accelerate version: 0.25.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2+cu121 (True)
  • Tensorflow version (GPU?): 2.16.1 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.8.2 (cpu)
  • Jax version: 0.4.26
  • JaxLib version: 0.4.21`

Who can help?

@ArthurZucker @sgugger since I see some implementations of this inside accelerate to skip initialization.

Reproduction

c = LlamaConfig(<path to config.json>)
with torch.device('meta'):
    m = LlamaForCausalLM(c)
    
w = torch.load(<path to weights.bin file>)
m.load_state_dict(w, assign=True)
m.to("cuda:0") //throws error

The last line throws the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/goelayus/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2692, in to
    return super().to(*args, **kwargs)
  File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 849, in _apply
    self._buffers[key] = fn(buf)
  File "/home/goelayus/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

Expected behavior

The model should be copied to the GPU device.

goelayu avatar May 07 '24 22:05 goelayu

To add to the above, if i use init_empty_weights from accelerate I can skip the initialization without any errors.

Wondering what is the difference between the two? Also if it is possible to achieve the same using the torch.device('meta') context manager.

goelayu avatar May 07 '24 23:05 goelayu

Mmmm could you make sure that the map_location is correct? This might be expected, cc @SunMarc WDYT?

ArthurZucker avatar May 09 '24 14:05 ArthurZucker

So this issue seems to be documented in the code itself big_modeling.py, turns out you can't run model.to when using the meta device. I was hoping for some kind of explanation as to why is that the case? (hence tagged @sgugger since the big_modeling.py file seems to be often modified by them)

Also if you notice my comment from above, replacing torch.device('meta') with init_empty_weights from the accelerate package seems to resolve the issue.

goelayu avatar May 09 '24 17:05 goelayu

cc @muellerzr for the accelerate related stuff rather than Sylvain!

ArthurZucker avatar May 10 '24 06:05 ArthurZucker

Hi @goelayu, this is expected since with torch.device('meta') also puts the buffers on the meta device. However, non persistant buffers are not saved in the state_dict. So, in the case of a llama model where we do have non persistant buffers, you get an error after loading the weights With init_empty_weights, by default, we don't put the buffer on the meta device. This is why it is working. Hope it is clearer !

SunMarc avatar May 13 '24 12:05 SunMarc

@SunMarc thanks for the response, that answers my question.

goelayu avatar May 17 '24 18:05 goelayu