Qwen3-30B issue: AttributeError: 'MoELayer' object has no attribute 'linear_fc1'
When I run the qwen3 script, the following error message appears. My megatron version is v0.11, vllm is v0.8.2. Should I upgrade them? Or do I need to make other changes? File "/share/cxx/verl/verl/single_controller/ray/base.py", line 663, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) File "/share/cxx/verl/verl/single_controller/base/decorator.py", line 540, in inner return func(*args, **kwargs) File "/share/cxx/verl/verl/workers/megatron_workers.py", line 386, in init_model self.ref_module, self.ref_model_config = self._build_model_optimizer( File "/share/cxx/verl/verl/workers/megatron_workers.py", line 206, in _build_model_optimizer load_megatron_gptmodel_weights(self.config, self.hf_config, ref_module, params_dtype=self.dtype, is_value_model=False) File "/share/cxx/verl/verl/utils/model.py", line 378, in load_megatron_gptmodel_weights load_state_dict_to_megatron_gptmodel( File "/share/cxx/verl/verl/models/mcore/loader.py", line 420, in load_state_dict_to_megatron_gptmodel sync_layer.mlp.linear_fc1.layer_norm_weight if dst_pp_rank == pp_rank else None, File "/opt/conda/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in getattr raise AttributeError( AttributeError: 'MoELayer' object has no attribute 'linear_fc1'
@cxxuser Seems it is a bug related to mcore v0.11 (MoE not fully supported), could you try the latest mcore(v0.12.1)?
sync_layer.mlp.linear_fc1.layer_norm_weight if dst_pp_rank == pp_rank else None, qwen3 does not have linear_fc1, and there are some problems with this part, if I don't use the dist to load.
@cxxuser Currently moe models not support huggingface loading,
- you can convert huggingface model to dist_checkpoint with our provided scripts
convertor_hf_to_mcore_models.py - can test mbridge integration #2064
@cxxuser Currently moe models not support huggingface loading,
- you can convert huggingface model to dist_checkpoint with our provided scripts
convertor_hf_to_mcore_models.py- can test mbridge integration [megatron] feat: use mbridge as megatron adaptor #2064
when i convert hf model to mcore model, how to set it? when i set actor_rollout_ref.model.path= mcore_core_path,it report errors
@cxxuser Currently moe models not support huggingface loading,
- you can convert huggingface model to dist_checkpoint with our provided scripts
convertor_hf_to_mcore_models.py- can test mbridge integration [megatron] feat: use mbridge as megatron adaptor #2064
when i convert hf model to mcore model, how to set it? when i set actor_rollout_ref.model.path= mcore_core_path,it report errors
actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
@cxxuser Currently moe models not support huggingface loading,
- you can convert huggingface model to dist_checkpoint with our provided scripts
convertor_hf_to_mcore_models.py- can test mbridge integration [megatron] feat: use mbridge as megatron adaptor #2064
when i convert hf model to mcore model, how to set it? when i set actor_rollout_ref.model.path= mcore_core_path,it report errors
actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
thx,when i use tp,is there need to convert it to mcore model with tp? and i find verl scripts convertor_hf_to_mcore_models.py didn't surrpot any payallel strtegy
@cxxuser Currently moe models not support huggingface loading,
- you can convert huggingface model to dist_checkpoint with our provided scripts
convertor_hf_to_mcore_models.py- can test mbridge integration [megatron] feat: use mbridge as megatron adaptor #2064
when i convert hf model to mcore model, how to set it? when i set actor_rollout_ref.model.path= mcore_core_path,it report errors
actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \thx,when i use tp,is there need to convert it to mcore model with tp? and i find verl scripts convertor_hf_to_mcore_models.py didn't surrpot any payallel strtegy
no need, we will reshard automaticly
I encountered the same problem when changing the weights. Even if I didn't change the weights, I still couldn't load the model. case1:
File "/verl/verl/single_controller/ray/base.py", line 704, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
File "/verl/verl/single_controller/base/decorator.py", line 548, in inner
return func(*args, **kwargs)
File "/verl/verl/workers/megatron_workers.py", line 455, in init_model
self.ref_module, self.ref_model_config = self._build_model_optimizer(
File "/verl/verl/workers/megatron_workers.py", line 251, in _build_model_optimizer
load_megatron_gptmodel_weights(
File "/verl/verl/utils/model.py", line 494, in load_megatron_gptmodel_weights
load_state_dict_to_megatron_gptmodel(
File "/verl/verl/models/mcore/loader.py", line 441, in load_state_dict_to_megatron_gptmodel
sync_layer.mlp.linear_fc1.layer_norm_weight if dst_pp_rank == pp_rank else None,
File "miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
raise AttributeError(
AttributeError: 'MoELayer' object has no attribute 'linear_fc1'
case2
bash convert_mscore.sh
Qwen3MoeConfig {
"architectures": [
"Qwen3MoeForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"decoder_sparse_step": 1,
"eos_token_id": 151643,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 6144,
"max_position_embeddings": 32768,
"max_window_layers": 48,
"mlp_only_layers": [],
"model_type": "qwen3_moe",
"moe_intermediate_size": 768,
"norm_topk_prob": true,
"num_attention_heads": 32,
"num_experts": 128,
"num_experts_per_tok": 8,
"num_hidden_layers": 48,
"num_key_value_heads": 4,
"output_router_logits": false,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"router_aux_loss_coef": 0.001,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.53.1",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}
Pipeline shards: [48]
Overridden TF init config: {'num_layers': 48, 'hidden_size': 2048, 'num_attention_heads': 32, 'num_query_groups': 4, 'ffn_hidden_size': 6144, 'attention_dropout': 0.0, 'hidden_dropout': 0.0, 'kv_channels': 128, 'layernorm_epsilon': 1e-06, 'add_bias_linear': False, 'activation_func': <function silu at 0x7f730c1977f0>, 'normalization': 'RMSNorm', 'gated_linear_unit': True, 'pipeline_dtype': torch.bfloat16, 'params_dtype': torch.bfloat16, 'bf16': True, 'tensor_model_parallel_size': 1, 'pipeline_model_parallel_size': 1, 'expert_model_parallel_size': 1, 'expert_tensor_parallel_size': 1, 'virtual_pipeline_model_parallel_size': None, 'context_parallel_size': 1, 'overlap_p2p_comm': False, 'batch_p2p_comm': False, 'sequence_parallel': False, 'variable_seq_lengths': True, 'masked_softmax_fusion': True, 'moe_token_dispatcher_type': 'alltoall', 'use_cpu_initialization': False, 'moe_ffn_hidden_size': 768, 'moe_router_bias_update_rate': 0.001, 'moe_router_topk': 8, 'num_moe_experts': 128, 'moe_aux_loss_coeff': 0.001, 'moe_router_load_balancing_type': 'none', 'moe_grouped_gemm': True, 'moe_router_score_function': 'softmax', 'persist_layer_norm': True, 'bias_activation_fusion': True, 'bias_dropout_fusion': True, 'moe_router_pre_softmax': False, 'qk_layernorm': True, 'num_layers_in_first_pipeline_stage': None, 'num_layers_in_last_pipeline_stage': None}
/Megatron-LM/megatron/core/transformer/transformer_config.py:1251: UserWarning: Using a large number of experts (e.g. >=32) without fp32 routing. Consider enabling moe_router_dtype for better numerical stability.
warnings.warn(
/Megatron-LM/megatron/core/extensions/transformer_engine_spec_provider.py:69: UserWarning: The legacy GroupedMLP will be deprecated in Megatron-Core v0.12.0. Please update the TransformerEngine to version>=1.7.0 and use TEGroupedMLP.
warnings.warn(
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 30532122624
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:09<00:00, 1.63it/s]
[WARNING] Converting GQA model
[rank0]: Traceback (most recent call last):
[rank0]: File "/overthink/verl/scripts/converter_hf_to_mcore.py", line 544, in <module>
[rank0]: convert_hf_to_mcore(
[rank0]: File "/overthink/verl/scripts/converter_hf_to_mcore.py", line 519, in convert_hf_to_mcore
[rank0]: convert_checkpoint_from_transformers_to_megatron(hf_model, model[0].module, hf_config)
[rank0]: File "/cpfs01/shared/llm_ddd/zhangjin/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/overthink/verl/scripts/converter_hf_to_mcore.py", line 168, in convert_checkpoint_from_transformers_to_megatron
[rank0]: numel += safe_copy(fc1_weight, layer.mlp.experts.linear_fc1._parameters[f"weight{idx}"])
[rank0]: File "miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
[rank0]: raise AttributeError(
[rank0]: AttributeError: 'GroupedMLP' object has no attribute 'linear_fc1'
[rank0]:[W717 16:24:07.987314227 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
megatron-core 0.14.0rc2 Additionally, I have discovered that if the transformer-engine is not updated to version > 1.9.0,triggers it in mcore:
def layer_norm(self, rms_norm: bool = False, for_qk: bool = False) -> type:
"""Which module to use for layer norm"""
# if for_qk and not is_te_min_version("1.9.0"):
# TENorm significantly harms convergence when used
# for QKLayerNorm if TE Version < 1.9;
# we instead use the Apex implementation.
return FusedLayerNorm
return TENorm
I didn't find any explanations on this matter when I was reviewing the documents.Qwen3-Moe uses RMS_Norm in qk-norm, however, it trys to use FusedLayerNorm instead.
Anyway, I would like to know if there is any way to solve the problem of fc1.
same issue!
same issue
same