verl icon indicating copy to clipboard operation
verl copied to clipboard

MegatronModelMerger fails for VL models (Qwen3VL) with AttributeError: 'Qwen3VLConfig' object has no attribute 'num_hidden_layers'

Open Eisenhower opened this issue 3 weeks ago • 2 comments

System Info

  • verl version: 0.6.1
  • Platform: Linux
  • Python version: 3.12
  • PyTorch version: 2.5+
  • Transformers version: 4.57.1
  • CUDA version: 12.x

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

  1. 使用 Megatron 后端训练 Qwen3-VL 模型(如 Qwen3-VL-8B)
  2. 训练完成后,调用 model merger 将 checkpoint 转换为 HuggingFace 格式: python3 -m verl.model_merger merge
    --backend megatron
    --local_dir /path/to/checkpoints/global_step_8/actor
    --target_dir /path/to/hf_model_final
  3. 触发以下错误: [rank0]: Traceback (most recent call last): [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in _run_code [rank0]: File "/path/to/verl/model_merger/main.py", line 73, in [rank0]: main() [rank0]: File "/path/to/verl/model_merger/main.py", line 68, in main [rank0]: merger.merge_and_save() [rank0]: File "/path/to/verl/model_merger/megatron_model_merger.py", line 496, in merge_and_save [rank0]: model_state_dict = self._load_state_dicts(model_ckpt_path) [rank0]: File "/path/to/verl/model_merger/megatron_model_merger.py", line 232, in _load_state_dicts [rank0]: self.pipeline_shards = get_dynamic_pipeline_shards(self.hf_config.num_hidden_layers, self.world_size) [rank0]: AttributeError: 'Qwen3VLConfig' object has no attribute 'num_hidden_layers

Root Cause: VL 模型配置(如 Qwen3VLConfig)的 num_hidden_layers 等属性位于 text_config 子配置中: { "model_type": "qwen3_vl", "text_config": { "num_hidden_layers": 36, "num_attention_heads": 32, "num_key_value_heads": 8 }, "vision_config": { ... } }

当前代码直接访问 self.hf_config.num_hidden_layers,对 VL 模型会失败。

Expected behavior

Megatron checkpoint 应该能够成功转换为 HuggingFace 格式,无论是标准 LLM 还是视觉语言模型(VL models)。 对于 VL 模型,MegatronModelMerger 应该正确处理嵌套的配置结构,从 text_config 中获取 num_hidden_layers 等属性。

Eisenhower avatar Dec 01 '25 02:12 Eisenhower

+1

mangozz2019 avatar Dec 01 '25 12:12 mangozz2019

+1

yushaohan avatar Dec 04 '25 08:12 yushaohan

+1,想问一下有什么临时的解决办法吗

FangXinyu-0913 avatar Dec 04 '25 17:12 FangXinyu-0913

突然发现3vl用megatron跑保存的checkpoint不用merge

Image

yushaohan avatar Dec 09 '25 08:12 yushaohan

请问解决了吗

whu125 avatar Dec 20 '25 03:12 whu125

我直接在外部加了一个k-v就成功了, 比如之前是 VL 模型配置(如 Qwen3VLConfig)的 num_hidden_layers 等属性位于 text_config 子配置中: { "model_type": "qwen3_vl", "text_config": { "num_hidden_layers": 36, "num_attention_heads": 32, "num_key_value_heads": 8 }, "vision_config": { ... } }

我修改为 VL 模型配置(如 Qwen3VLConfig)的 num_hidden_layers 等属性位于 text_config 子配置中: { "model_type": "qwen3_vl", "num_hidden_layers": 36, "text_config": { "num_hidden_layers": 36, "num_attention_heads": 32, "num_key_value_heads": 8 }, "vision_config": { ... } }

WjzZwd avatar Dec 22 '25 06:12 WjzZwd