DeepSpeed [BUG] Qwen3: model loading failed when using meta device

Describe the bug I am running on a single node with 4 GPUs; each GPU has 24GB GPU memory.

With Deepspeed-Inference, I was trying to load Qwen/Qwen3-4B using meta device. However, the loading failed and I got the following error:

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

Although this small model doesn't need meta device, my ultimate goal is to use the bigger qwen3 models.

To Reproduce Steps to reproduce the behavior:

Simple inference script to reproduce.

First of all, download Qwen/Qwen3-4B to local directory "Qwen3-4B"

Then, put the following code snippet to "qwen3_meta_device.py"

import os

import deepspeed
import torch
from transformers import AutoConfig, AutoModelForCausalLM

kwargs = {"torch_dtype": torch.float16}

model_config = AutoConfig.from_pretrained("./Qwen3-4B", **kwargs)

with deepspeed.OnDevice(dtype=kwargs["torch_dtype"], device="meta", enabled=True):
    model = AutoModelForCausalLM.from_config(model_config, **kwargs)

ds_inference_config = {
    "dtype": kwargs["torch_dtype"],
    "replace_with_kernel_inject": False,
    "tensor_parallel": {
        "tp_size": int(os.getenv("WORLD_SIZE", "1"))
    },
    "checkpoint": {
        "checkpoints": [
            "./Qwen3-4B/model-00001-of-00003.safetensors",
            "./Qwen3-4B/model-00002-of-00003.safetensors",
            "./Qwen3-4B/model-00003-of-00003.safetensors"
        ],
        "type": "DS_MODEL",
        "version": 1.0
    }
}

ds_engine = deepspeed.init_inference(model, config=ds_inference_config)
model = ds_engine.module
model.eval()

Finally, run "accelerate launch qwen3_meta_device.py"

What packages are required and their versions torch==2.5.1 transformers==4.51.3 deepspeed==0.16.7 accelerate==1.6.0
How to run the script Put the above code snippet into this file: qwen3_meta_device.py Then, run the following: accelerate launch qwen3_meta_device.py
... Expected behavior The model is expected to load successfully.

The code works fine for qwen2.5-7b-instruct (after replacing the checkpoint files in the config).

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
(if applicable) what DeepSpeed-MII version are you using
(if applicable) Hugging Face Transformers/Accelerate/etc. versions
Python version
Any other relevant info about your setup

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

May 09 '25 13:05 songdezhao

Hello @delock , wondering whether you could also take a look at this one? Thanks.

May 16 '25 01:05 songdezhao

I met error as well on CPU device. Qwen3 meta tensor loading may not be supported yet.

[rank5]:     raise NotImplementedError(                                                                        
[rank5]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() in
stead of torch.nn.Module.to() when moving module from meta to a different device.

May 16 '25 07:05 delock

Hi @songdezhao can you check whether this branch fix your issue? https://github.com/deepspeedai/DeepSpeed/tree/gma/enable_qwen3_meta

May 16 '25 07:05 delock

Hello @delock , thanks for the pull request.

I tested this and yes, it worked for the dense models (e.g., qwen3-8b and qwen3-32b).

However, it still failed on the MoE models with the same error, e.g., Qwen3-30B-A3B.

May 19 '25 04:05 songdezhao

Hello @loadams , could we re-open this?

When I was testing the Qwen3-MoE models, I still got the same error. The dense models work fine.

May 19 '25 18:05 songdezhao

@songdezhao - sure, this was closed because the PR I merged was linked. I'll re-open and we can tag @delock for additional fixes.

May 19 '25 18:05 loadams

@songdezhao It seems that qwen3_moe use another norm function which is different from qwen3, can you try this commit to fix this qwen3_moe error, https://github.com/deepspeedai/DeepSpeed/pull/7297/commits/12558b92f540aec04aa7835ce7df522fa68e9f25

May 20 '25 06:05 ranzhejiang

@songdezhao - please let me know if this resolves the issue.

May 20 '25 15:05 loadams

@ranzhejiang , thanks for the fix. I tested this commit and it worked when I use "sdpa" attention. However, if I change the attention to "flash_attention_2", I still got an error.

In my original code, I changed this line to load the model with FA2:

kwargs = {"torch_dtype": torch.float16, "attn_implementation": "flash_attention_2"}

I also updated some of the library versions:

torch==2.6.0+cu126
transformers==4.51.3
accelerate==1.6.0
deepspeed==0.16.9+12558b92
flash-attn==2.7.3

Here is the error I have. The model was indeed loaded, and the error happened when I call "generate" to start generating the otuput:

[rank7]: Traceback (most recent call last):
[rank7]:     generated_ids = model.generate(**model_inputs,
[rank7]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/generation/utils.py", line 2465, in generate
[rank7]:     result = self._sample(
[rank7]:              ^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/generation/utils.py", line 3431, in _sample
[rank7]:     outputs = self(**model_inputs, return_dict=True)
[rank7]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/utils/generic.py", line 965, in wrapper
[rank7]:     output = func(self, *args, **kwargs)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 1043, in forward
[rank7]:     outputs: MoeModelOutputWithPast = self.model(
[rank7]:                                       ^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/utils/generic.py", line 965, in wrapper
[rank7]:     output = func(self, *args, **kwargs)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 673, in forward
[rank7]:     layer_outputs = decoder_layer(
[rank7]:                     ^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 375, in forward
[rank7]:     hidden_states, self_attn_weights = self.self_attn(
[rank7]:                                        ^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 206, in forward
[rank7]:     attn_output, attn_weights = attention_interface(
[rank7]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/integrations/flash_attention.py", line 49, in flash_attention_forward
[rank7]:     attn_output = _flash_attention_forward(
[rank7]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 353, in _flash_attention_forward
[rank7]:     query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = _upad_input(
[rank7]:                                                                                    ^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 153, in _upad_input
[rank7]:     key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k)
[rank7]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
[rank7]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/flash_attn/bert_padding.py", line 17, in forward
[rank7]:     return torch.gather(
[rank7]:            ^^^^^^^^^^^^^
[rank7]: RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0, 128] because the unspecified dimension size -1 can be any value and is ambiguous

May 20 '25 15:05 songdezhao

@loadams : please see my additional test above.

May 20 '25 16:05 songdezhao

@ranzhejiang , thanks for the fix. I tested this commit and it worked when I use "sdpa" attention. However, if I change the attention to "flash_attention_2", I still got an error.

In my original code, I changed this line to load the model with FA2:

kwargs = {"torch_dtype": torch.float16, "attn_implementation": "flash_attention_2"}

I also updated some of the library versions:

torch==2.6.0+cu126
transformers==4.51.3
accelerate==1.6.0
deepspeed==0.16.9+12558b92
flash-attn==2.7.3

Here is the error I have. The model was indeed loaded, and the error happened when I call "generate" to start generating the otuput:

[rank7]: Traceback (most recent call last):
[rank7]:     generated_ids = model.generate(**model_inputs,
[rank7]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/generation/utils.py", line 2465, in generate
[rank7]:     result = self._sample(
[rank7]:              ^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/generation/utils.py", line 3431, in _sample
[rank7]:     outputs = self(**model_inputs, return_dict=True)
[rank7]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/utils/generic.py", line 965, in wrapper
[rank7]:     output = func(self, *args, **kwargs)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 1043, in forward
[rank7]:     outputs: MoeModelOutputWithPast = self.model(
[rank7]:                                       ^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/utils/generic.py", line 965, in wrapper
[rank7]:     output = func(self, *args, **kwargs)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 673, in forward
[rank7]:     layer_outputs = decoder_layer(
[rank7]:                     ^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 375, in forward
[rank7]:     hidden_states, self_attn_weights = self.self_attn(
[rank7]:                                        ^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 206, in forward
[rank7]:     attn_output, attn_weights = attention_interface(
[rank7]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/integrations/flash_attention.py", line 49, in flash_attention_forward
[rank7]:     attn_output = _flash_attention_forward(
[rank7]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 353, in _flash_attention_forward
[rank7]:     query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = _upad_input(
[rank7]:                                                                                    ^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/transformers/modeling_flash_attention_utils.py", line 153, in _upad_input
[rank7]:     key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k)
[rank7]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
[rank7]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/site-packages/flash_attn/bert_padding.py", line 17, in forward
[rank7]:     return torch.gather(
[rank7]:            ^^^^^^^^^^^^^
[rank7]: RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0, 128] because the unspecified dimension size -1 can be any value and is ambiguous

@songdezhao Can I get the result for qwen3 when you change the attention to "flash_attention_2". From the logs, it seems that both qwen3 and qwen3_moe will get the same error.

May 20 '25 16:05 ranzhejiang

@songdezhao For this issue, both qwen3 and qwen3_moe models was indeed loaded after our new commits, but for your new error, it seems that you add other new code in your scripts, that needs to debug

May 20 '25 16:05 ranzhejiang

@ranzhejiang: Yes, I added the code to do the actual generation. Here it is, and here is how I ran it:

Step 1: Download Qwen3-30B-A3B to a local directory "model_to_evaluate". Step 2: Put the following code to "test_qwen3.py". Step 3: Run this: accelerate launch test_qwen3.py

I tested it with both Qwen3-32B and Qwen3-30B-A3B. For the dense model, I got reasonable outputs but the error in my above message occurred for the MoE model.

import glob
import os

import deepspeed
import torch
from accelerate import PartialState
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer

state = PartialState()

model_path = "./model_to_evaluate"

kwargs = {"torch_dtype": torch.float16, "attn_implementation": "flash_attention_2"}
print(f"kwargs: {kwargs}")

# load model using meta device
model_config = AutoConfig.from_pretrained(model_path, **kwargs)
with deepspeed.OnDevice(dtype=kwargs["torch_dtype"], device="meta", enabled=True):
    model = AutoModelForCausalLM.from_config(model_config, **kwargs)
print(f"model device: {next(model.parameters()).device}")

# set up deepspeed inference config
ds_inference_config = {
    "dtype": kwargs["torch_dtype"],
    # meta device is not compatible with kernel injection
    "replace_with_kernel_inject": False,
    # tp equals to the global number of gpus
    "tensor_parallel": {
        "tp_size": int(os.getenv("WORLD_SIZE", "1"))
    },
    # specify where the model files are
    "checkpoint": {
        "checkpoints": glob.glob(os.path.join(model_path, "**", "*" + ".safetensors"), recursive=True),
        "type": "DS_MODEL",
        "version": 1.0
    }
}
print(f"deepspeed inference config: {ds_inference_config}")

ds_engine = deepspeed.init_inference(model, config=ds_inference_config)
model = ds_engine.module
model.eval()
print(f"model loaded on to device: {next(model.parameters()).device}")

tokenizer = AutoTokenizer.from_pretrained(model_path, add_bos_token=True, add_eos_token=False)
tokenizer.padding_side = "left"
inputs = [
    "What is DeepSpeed",
    "What is huggingface"
]
inputs_tokenized = [
    tokenizer.apply_chat_template([{"role": "user", "content": text}],
                                  add_generation_prompt=True,
                                  tokenize=False,
                                  enable_thinking=False)
    for text in inputs
]

model_inputs = tokenizer(inputs_tokenized, return_tensors="pt", add_special_tokens=False, padding="longest")
model_inputs = model_inputs.to(state.device)

generated_ids = model.generate(**model_inputs,
                               pad_token_id=tokenizer.pad_token_id,
                               eos_token_id=tokenizer.eos_token_id,
                               max_new_tokens=10,
                               synced_gpus=True,
                               use_cache=True,
                               do_sample=False)

prompt_length = model_inputs['input_ids'].shape[1]
batch_responses = tokenizer.batch_decode(generated_ids[:, prompt_length:], skip_special_tokens=True)
print(f"process: {state.process_index}/{state.device}, batch responses: {batch_responses}")

May 20 '25 16:05 songdezhao

@songdezhao Thanks for your scripts, this error seems hard to debug for me, any idea for this problem ? I will try to find a GPU machine to solve it.

May 20 '25 16:05 ranzhejiang

@ranzhejiang, I also tried these:

I tried both eager and sdpa, and they both worked. So it seems there may be some incompatibility between FA2 and DeepSpeed on the Qwen3-MoE models.
I then upgraded FA2 to the latest commit but still got the same error.

I am not sure what exactly the problem is now. It would be great it you could take a further look at this issue.

May 21 '25 01:05 songdezhao

@songdezhao After debugging, Your code triggered a boundary condition for qwen3-moe

query_states.shape is: torch.Size([2, 0, 17, 128]

The root cause is that the cuda or triton kernel can not support this case: dimensions of size is 0, but traditional SDPA is implemented using PyTorch's tensor operations, which have robust support for tensors with dimensions of size 0.

I will give a PR to solve it (maybe need your suggestion and review) The root cause for dimensions size is 0 is auto_tp problem, which also need to give a PR to fix.

May 21 '25 15:05 ranzhejiang

Thanks a lot for looking into this. Simply curious: Is this Qwen3-MoE specific or would we expect same issue to happen for other MoE models as well?

May 21 '25 15:05 songdezhao

Thanks a lot for looking into this. Simply curious: Is this Qwen3-MoE specific or would we expect same issue to happen for other MoE models as well?

I am not sure, but I think we need to add more checks with shape, dim or size when using some specific kernel, anyway, I need time to add these checks or debug

May 21 '25 15:05 ranzhejiang

@songdezhao After debugging, Your code triggered a boundary condition for qwen3-moe

query_states.shape is: torch.Size([2, 0, 17, 128] The root cause is that the cuda or triton kernel can not support this case: dimensions of size is 0, but traditional SDPA is implemented using PyTorch's tensor operations, which have robust support for tensors with dimensions of size 0.

I will give a PR to solve it (maybe need your suggestion and review) The root cause for dimensions size is 0 is auto_tp problem, which also need to give a PR to fix.

@songdezhao @loadams

For zero dim error in flash attention, I have give a PR https://github.com/huggingface/transformers/pull/38280 which can raise more useful errors to avoid this case.
For auto_tp problem which cause zero dim error, I will debug later.

May 22 '25 09:05 ranzhejiang

Hello, I am loading a Qwen2.5-72B, and I hit the same error. ANy help?

torch==2.6.0
torchvision==0.21.0
torchaudio==2.6.0
datasets==3.0.0
huggingface-hub==0.30.0
transformers==4.52.4
accelerate==1.7.0
trl==0.17.0
deepspeed==0.17.1

I load the model like this:

with init_empty_weights():
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            config=model_config,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            device_map=None  # Don't use device_map with DeepSpeed
        )

Then, I do the base SFT tools from transformers and when the code kits deepspeed prepare I get the same error as above.

        trainer = SFTTrainer(
            model=model, 
            args=training_args,
            train_dataset=tokenized_dataset['train'],
            eval_dataset=tokenized_dataset['test'], 
        )

[rank2]: Traceback (most recent call last): [rank2]: File "/project/6102313/mheuill/accelerated_ReFT/./job_main.py", line 198, in [rank2]: main(accelerator, config) [rank2]: File "/project/6102313/mheuill/accelerated_ReFT/./job_main.py", line 130, in main [rank2]: trainer.sft_main(config, accelerator) [rank2]: File "/project/6102313/mheuill/accelerated_ReFT/trainers/sft_2.py", line 105, in sft_main [rank2]: trainer.train() [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/transformers/trainer.py", line 2240, in train [rank2]: return inner_training_loop( [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop [rank2]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/accelerate/accelerator.py", line 1432, in prepare [rank2]: result = self._prepare_deepspeed(*args) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/accelerate/accelerator.py", line 2028, in _prepare_deepspeed [rank2]: engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize [rank2]: engine = DeepSpeedEngine(args=args, [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 283, in init [rank2]: self._configure_distributed_model(model) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1272, in _configure_distributed_model [rank2]: self.module.to(self.device) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3851, in to [rank2]: return super().to(*args, **kwargs) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1343, in to [rank2]: return self._apply(convert) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply [rank2]: module._apply(fn) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply [rank2]: module._apply(fn) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 930, in _apply [rank2]: param_applied = fn(param) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1336, in convert [rank2]: raise NotImplementedError( [rank2]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. Exception ignored in: <function DeepSpeedEngine.del at 0x7a525a813c70>

Jun 16 '25 21:06 MaxHeuillet

Hello, I am loading a Qwen2.5-72B, and I hit the same error. ANy help?

torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 datasets==3.0.0 huggingface-hub==0.30.0 transformers==4.52.4 accelerate==1.7.0 trl==0.17.0 deepspeed==0.17.1 I load the model like this:

with init_empty_weights(): model = AutoModelForCausalLM.from_pretrained( model_path, config=model_config, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map=None # Don't use device_map with DeepSpeed ) Then, I do the base SFT tools from transformers and when the code kits deepspeed prepare I get the same error as above.
    trainer = SFTTrainer(
        model=model, 
        args=training_args,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=tokenized_dataset['test'], 
    )
[rank2]: Traceback (most recent call last): [rank2]: File "/project/6102313/mheuill/accelerated_ReFT/./job_main.py", line 198, in [rank2]: main(accelerator, config) [rank2]: File "/project/6102313/mheuill/accelerated_ReFT/./job_main.py", line 130, in main [rank2]: trainer.sft_main(config, accelerator) [rank2]: File "/project/6102313/mheuill/accelerated_ReFT/trainers/sft_2.py", line 105, in sft_main [rank2]: trainer.train() [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/transformers/trainer.py", line 2240, in train [rank2]: return inner_training_loop( [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop [rank2]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/accelerate/accelerator.py", line 1432, in prepare [rank2]: result = self._prepare_deepspeed(*args) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/accelerate/accelerator.py", line 2028, in _prepare_deepspeed [rank2]: engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize [rank2]: engine = DeepSpeedEngine(args=args, [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 283, in init [rank2]: self._configure_distributed_model(model) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1272, in _configure_distributed_model [rank2]: self.module.to(self.device) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3851, in to [rank2]: return super().to(*args, **kwargs) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1343, in to [rank2]: return self._apply(convert) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply [rank2]: module._apply(fn) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply [rank2]: module._apply(fn) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 930, in _apply [rank2]: param_applied = fn(param) [rank2]: File "/tmp/myenv_reprod31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1336, in convert [rank2]: raise NotImplementedError( [rank2]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. Exception ignored in: <function DeepSpeedEngine.del at 0x7a525a813c70>

Hi @MaxHeuillet 0.17.1 alreay have two Qwen3 fixes, so probably this is due to a different cause. Do you have the full model script code that can reproduce the error?

Jun 18 '25 06:06 delock

Thanks a lot for looking into this. Simply curious: Is this Qwen3-MoE specific or would we expect same issue to happen for other MoE models as well?

@songdezhao @loadams I have found the root cause, the qwen3_moe Qwen3-30B-A3B using GQA whose num_kv_heads is 4, which can't be divisible by GPU number 8, so some rank got 0, some rank get 1, you can only set the tensor parallel size to be smaller than the num_kv_heads.

Jul 02 '25 18:07 ranzhejiang