Cannot load LLaMA3 with transformers==4.57.1 on non-cuda accelerators using tensor parallelism (tp_plan="auto")
System Info
Transformers: 4.57.1 PyTorch: 2.9 Hardware: non-cuda accelerator OS: Linux Python: 3.11
Who can help?
Observed Error:
RuntimeError: Attempted to call variable.set_data(tensor), but variable and tensor have incompatible tensor type.
Root Cause:
During model loading, transformers attempts: buffer.data = buffer.to(tp_device)
Here, buffer is a CPU tensor, and tp_device is a non-cuda device. This shallow copy fails because non-cuda device is not treated as compatible with CPU in PyTorch’s TensorImpl logic. CUDA has legacy compatibility, but other accelerators does not.
Related Hugging Face discussion: #5915.
Information
- [ ] The official example scripts
- [x] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
load meta-llama/Meta-Llama-3-8B-Instruct in non-cuda accelerators with tp_plan="auto"
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
tp_plan="auto",
torch_dtype=torch.bfloat16,
)
Expected behavior
Buffers should move to tp_device without triggering incompatible tensor type errors for non-CUDA accelerators.
Possible fix:
# Post-processing for tensor parallelism
if device_mesh is not None:
# When using TP, the device map is a single device for all parameters
tp_device = list(device_map.values())[0]
# This is needed for the RotaryEmbedding, which was not initialized on the correct device as it is
# not part of the state_dict (persistent=False)
for buffer in model.buffers():
if buffer.device != tp_device:
if tp_device.type == "cuda":
buffer.data = buffer.to(tp_device)
else:
buffer_tmp = torch.empty_like(buffer, device=tp_device)
buffer_tmp.copy_(buffer.to(tp_device))
buffer = buffer_tmp
I am interested to work on this issue.
Cyril wrote that block but he's out right now - cc @arthurzucker maybe? If you're overloaded let me know and I can try to handle TP stuff like this for a while
I reviewed the suggested fix and can confirm that the root cause lies in the buffer.data = buffer.to(tp_device) call.
I tested a similar patch on my local setup using a non-CUDA accelerator (MPS on macOS), and the modified logic with torch.empty_like(..., device=tp_device) works as expected — no tensor type conflict occurs.
The fix seems safe and backward-compatible with CUDA setups too.
If needed, I can raise a PR to add this device-agnostic handling into the tensor parallel post-processing section.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Maybe cc @cyrilvallez now he's back 😅
Hey! The issue is not about the tp_device and cuda/other accelerator, it's the fact that we are setting the data to the buffer itself! So the best will be to do something like:
buffer.data = buffer.data.to(tp_device)
Could you check it works on your end? 🤗
Hmm, it does not work. @Cyrilvallez
Check this:
This shallow copy(a.data=a.data.to(xxx)) fails because non-cuda device is not treated as compatible with CPU in PyTorch’s TensorImpl logic. CUDA has legacy compatibility, but other accelerators does not.
Humm, I tested both with mps (mac) and amd gpus hardware, and it works in both cases... Both with a = a.data.to() and a.data = a.data.to(). Do you have more details about your hardware and issue?