transformers Cannot load LLaMA3 with transformers==4.57.1 on non-cuda accelerators using tensor parallelism (tp

System Info

Transformers: 4.57.1 PyTorch: 2.9 Hardware: non-cuda accelerator OS: Linux Python: 3.11

Who can help?

Observed Error: RuntimeError: Attempted to call variable.set_data(tensor), but variable and tensor have incompatible tensor type.

Root Cause: During model loading, transformers attempts: buffer.data = buffer.to(tp_device)

Here, buffer is a CPU tensor, and tp_device is a non-cuda device. This shallow copy fails because non-cuda device is not treated as compatible with CPU in PyTorch’s TensorImpl logic. CUDA has legacy compatibility, but other accelerators does not.

Related Hugging Face discussion: #5915.

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

load meta-llama/Meta-Llama-3-8B-Instruct in non-cuda accelerators with tp_plan="auto"

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    tp_plan="auto",
    torch_dtype=torch.bfloat16,
)

Expected behavior

Buffers should move to tp_device without triggering incompatible tensor type errors for non-CUDA accelerators.

Possible fix:

        # Post-processing for tensor parallelism
        if device_mesh is not None:
            # When using TP, the device map is a single device for all parameters
            tp_device = list(device_map.values())[0]
            # This is needed for the RotaryEmbedding, which was not initialized on the correct device as it is
            # not part of the state_dict (persistent=False)
            for buffer in model.buffers():
                if buffer.device != tp_device:
                    if tp_device.type == "cuda":
                        buffer.data = buffer.to(tp_device) 
                    else:
                        buffer_tmp = torch.empty_like(buffer, device=tp_device)
                        buffer_tmp.copy_(buffer.to(tp_device))
                        buffer = buffer_tmp

Oct 17 '25 23:10 trajepl

I am interested to work on this issue.

Oct 18 '25 14:10 Nikita-Kedari

Cyril wrote that block but he's out right now - cc @arthurzucker maybe? If you're overloaded let me know and I can try to handle TP stuff like this for a while

Oct 20 '25 13:10 Rocketknight1

I reviewed the suggested fix and can confirm that the root cause lies in the buffer.data = buffer.to(tp_device) call.

I tested a similar patch on my local setup using a non-CUDA accelerator (MPS on macOS), and the modified logic with torch.empty_like(..., device=tp_device) works as expected — no tensor type conflict occurs.

The fix seems safe and backward-compatible with CUDA setups too.

If needed, I can raise a PR to add this device-agnostic handling into the tensor parallel post-processing section.

Oct 21 '25 16:10 kowshik-thatinati

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 17 '25 08:11 github-actions[bot]

Maybe cc @cyrilvallez now he's back 😅

Nov 17 '25 12:11 Rocketknight1

Hey! The issue is not about the tp_device and cuda/other accelerator, it's the fact that we are setting the data to the buffer itself! So the best will be to do something like:

buffer.data = buffer.data.to(tp_device)

Could you check it works on your end? 🤗

Nov 20 '25 10:11 Cyrilvallez

Hmm, it does not work. @Cyrilvallez Check this: This shallow copy(a.data=a.data.to(xxx)) fails because non-cuda device is not treated as compatible with CPU in PyTorch’s TensorImpl logic. CUDA has legacy compatibility, but other accelerators does not.

Nov 28 '25 16:11 trajepl

Humm, I tested both with mps (mac) and amd gpus hardware, and it works in both cases... Both with a = a.data.to() and a.data = a.data.to(). Do you have more details about your hardware and issue?

Dec 09 '25 14:12 Cyrilvallez

Cannot load LLaMA3 with transformers==4.57.1 on non-cuda accelerators using tensor parallelism (tp_plan="auto")

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior