verl Qwen3-30B issue: AttributeError: 'MoELayer' object has no attribute 'linear

When I run the qwen3 script, the following error message appears. My megatron version is v0.11, vllm is v0.8.2. Should I upgrade them? Or do I need to make other changes? File "/share/cxx/verl/verl/single_controller/ray/base.py", line 663, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) File "/share/cxx/verl/verl/single_controller/base/decorator.py", line 540, in inner return func(*args, **kwargs) File "/share/cxx/verl/verl/workers/megatron_workers.py", line 386, in init_model self.ref_module, self.ref_model_config = self._build_model_optimizer( File "/share/cxx/verl/verl/workers/megatron_workers.py", line 206, in _build_model_optimizer load_megatron_gptmodel_weights(self.config, self.hf_config, ref_module, params_dtype=self.dtype, is_value_model=False) File "/share/cxx/verl/verl/utils/model.py", line 378, in load_megatron_gptmodel_weights load_state_dict_to_megatron_gptmodel( File "/share/cxx/verl/verl/models/mcore/loader.py", line 420, in load_state_dict_to_megatron_gptmodel sync_layer.mlp.linear_fc1.layer_norm_weight if dst_pp_rank == pp_rank else None, File "/opt/conda/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in getattr raise AttributeError( AttributeError: 'MoELayer' object has no attribute 'linear_fc1'

Jun 18 '25 08:06 cxxuser

@cxxuser Seems it is a bug related to mcore v0.11 (MoE not fully supported), could you try the latest mcore(v0.12.1)?

Jun 23 '25 03:06 ETOgaosion

sync_layer.mlp.linear_fc1.layer_norm_weight if dst_pp_rank == pp_rank else None, qwen3 does not have linear_fc1, and there are some problems with this part, if I don't use the dist to load.

Jun 23 '25 08:06 cxxuser

@cxxuser Currently moe models not support huggingface loading,

you can convert huggingface model to dist_checkpoint with our provided scripts convertor_hf_to_mcore_models.py
can test mbridge integration #2064

Jun 23 '25 11:06 ETOgaosion

@cxxuser Currently moe models not support huggingface loading,

you can convert huggingface model to dist_checkpoint with our provided scripts convertor_hf_to_mcore_models.py

can test mbridge integration [megatron] feat: use mbridge as megatron adaptor #2064

when i convert hf model to mcore model, how to set it? when i set actor_rollout_ref.model.path= mcore_core_path,it report errors

Jun 30 '25 04:06 lilei199908

@cxxuser Currently moe models not support huggingface loading,

you can convert huggingface model to dist_checkpoint with our provided scripts convertor_hf_to_mcore_models.py

can test mbridge integration [megatron] feat: use mbridge as megatron adaptor #2064

when i convert hf model to mcore model, how to set it? when i set actor_rollout_ref.model.path= mcore_core_path,it report errors

    actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \

Jun 30 '25 05:06 Yangruipis

@cxxuser Currently moe models not support huggingface loading,

you can convert huggingface model to dist_checkpoint with our provided scripts convertor_hf_to_mcore_models.py

can test mbridge integration [megatron] feat: use mbridge as megatron adaptor #2064

when i convert hf model to mcore model, how to set it? when i set actor_rollout_ref.model.path= mcore_core_path,it report errors
actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \

thx,when i use tp,is there need to convert it to mcore model with tp? and i find verl scripts convertor_hf_to_mcore_models.py didn't surrpot any payallel strtegy

Jun 30 '25 05:06 lilei199908

@cxxuser Currently moe models not support huggingface loading,

you can convert huggingface model to dist_checkpoint with our provided scripts convertor_hf_to_mcore_models.py

can test mbridge integration [megatron] feat: use mbridge as megatron adaptor #2064

when i convert hf model to mcore model, how to set it? when i set actor_rollout_ref.model.path= mcore_core_path,it report errors
actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
thx,when i use tp,is there need to convert it to mcore model with tp? and i find verl scripts convertor_hf_to_mcore_models.py didn't surrpot any payallel strtegy

no need, we will reshard automaticly

Jun 30 '25 05:06 Yangruipis

I encountered the same problem when changing the weights. Even if I didn't change the weights, I still couldn't load the model. case1:

File "/verl/verl/single_controller/ray/base.py", line 704, in func
  return getattr(self.worker_dict[key], name)(*args, **kwargs)
File "/verl/verl/single_controller/base/decorator.py", line 548, in inner
  return func(*args, **kwargs)
File "/verl/verl/workers/megatron_workers.py", line 455, in init_model
  self.ref_module, self.ref_model_config = self._build_model_optimizer(
File "/verl/verl/workers/megatron_workers.py", line 251, in _build_model_optimizer
  load_megatron_gptmodel_weights(
File "/verl/verl/utils/model.py", line 494, in load_megatron_gptmodel_weights
  load_state_dict_to_megatron_gptmodel(
File "/verl/verl/models/mcore/loader.py", line 441, in load_state_dict_to_megatron_gptmodel
  sync_layer.mlp.linear_fc1.layer_norm_weight if dst_pp_rank == pp_rank else None,
File "miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
  raise AttributeError(
AttributeError: 'MoELayer' object has no attribute 'linear_fc1'

case2

bash convert_mscore.sh
Qwen3MoeConfig {
  "architectures": [
    "Qwen3MoeForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "decoder_sparse_step": 1,
  "eos_token_id": 151643,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 6144,
  "max_position_embeddings": 32768,
  "max_window_layers": 48,
  "mlp_only_layers": [],
  "model_type": "qwen3_moe",
  "moe_intermediate_size": 768,
  "norm_topk_prob": true,
  "num_attention_heads": 32,
  "num_experts": 128,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 48,
  "num_key_value_heads": 4,
  "output_router_logits": false,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "router_aux_loss_coef": 0.001,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.53.1",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

Pipeline shards: [48]
Overridden TF init config: {'num_layers': 48, 'hidden_size': 2048, 'num_attention_heads': 32, 'num_query_groups': 4, 'ffn_hidden_size': 6144, 'attention_dropout': 0.0, 'hidden_dropout': 0.0, 'kv_channels': 128, 'layernorm_epsilon': 1e-06, 'add_bias_linear': False, 'activation_func': <function silu at 0x7f730c1977f0>, 'normalization': 'RMSNorm', 'gated_linear_unit': True, 'pipeline_dtype': torch.bfloat16, 'params_dtype': torch.bfloat16, 'bf16': True, 'tensor_model_parallel_size': 1, 'pipeline_model_parallel_size': 1, 'expert_model_parallel_size': 1, 'expert_tensor_parallel_size': 1, 'virtual_pipeline_model_parallel_size': None, 'context_parallel_size': 1, 'overlap_p2p_comm': False, 'batch_p2p_comm': False, 'sequence_parallel': False, 'variable_seq_lengths': True, 'masked_softmax_fusion': True, 'moe_token_dispatcher_type': 'alltoall', 'use_cpu_initialization': False, 'moe_ffn_hidden_size': 768, 'moe_router_bias_update_rate': 0.001, 'moe_router_topk': 8, 'num_moe_experts': 128, 'moe_aux_loss_coeff': 0.001, 'moe_router_load_balancing_type': 'none', 'moe_grouped_gemm': True, 'moe_router_score_function': 'softmax', 'persist_layer_norm': True, 'bias_activation_fusion': True, 'bias_dropout_fusion': True, 'moe_router_pre_softmax': False, 'qk_layernorm': True, 'num_layers_in_first_pipeline_stage': None, 'num_layers_in_last_pipeline_stage': None}
/Megatron-LM/megatron/core/transformer/transformer_config.py:1251: UserWarning: Using a large number of experts (e.g. >=32) without fp32 routing. Consider enabling moe_router_dtype for better numerical stability.
  warnings.warn(
/Megatron-LM/megatron/core/extensions/transformer_engine_spec_provider.py:69: UserWarning: The legacy GroupedMLP will be deprecated in Megatron-Core v0.12.0. Please update the TransformerEngine to version>=1.7.0 and use TEGroupedMLP.
  warnings.warn(
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 30532122624
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:09<00:00,  1.63it/s]
[WARNING] Converting GQA model
[rank0]: Traceback (most recent call last):
[rank0]:   File "/overthink/verl/scripts/converter_hf_to_mcore.py", line 544, in <module>
[rank0]:     convert_hf_to_mcore(
[rank0]:   File "/overthink/verl/scripts/converter_hf_to_mcore.py", line 519, in convert_hf_to_mcore
[rank0]:     convert_checkpoint_from_transformers_to_megatron(hf_model, model[0].module, hf_config)
[rank0]:   File "/cpfs01/shared/llm_ddd/zhangjin/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/overthink/verl/scripts/converter_hf_to_mcore.py", line 168, in convert_checkpoint_from_transformers_to_megatron
[rank0]:     numel += safe_copy(fc1_weight, layer.mlp.experts.linear_fc1._parameters[f"weight{idx}"])
[rank0]:   File "miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: 'GroupedMLP' object has no attribute 'linear_fc1'
[rank0]:[W717 16:24:07.987314227 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

megatron-core 0.14.0rc2 Additionally, I have discovered that if the transformer-engine is not updated to version > 1.9.0，triggers it in mcore：

    def layer_norm(self, rms_norm: bool = False, for_qk: bool = False) -> type:
        """Which module to use for layer norm"""
        # if for_qk and not is_te_min_version("1.9.0"):
            # TENorm significantly harms convergence when used
            # for QKLayerNorm if TE Version < 1.9;
            # we instead use the Apex implementation.
            return FusedLayerNorm
        return TENorm

I didn't find any explanations on this matter when I was reviewing the documents.Qwen3-Moe uses RMS_Norm in qk-norm, however, it trys to use FusedLayerNorm instead.

Anyway, I would like to know if there is any way to solve the problem of fc1.

Jul 17 '25 08:07 JinZhang-21

same issue!

Nov 12 '25 06:11 luzengxiangcn

same issue

Nov 22 '25 13:11 zhenghaoxu-gatech

same

Dec 02 '25 07:12 DaizeDong

Qwen3-30B issue: AttributeError: 'MoELayer' object has no attribute 'linear_fc1'