Qwen2.5-vl crash when using mcore backend

Open yfw opened this issue 1 month ago • 1 comments

Describe the bug qwen2.5-vl using megatron path is crashing after megatron-bridge rebase to main branch.

  File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/qwen_vl/modeling_qwen25_vl.py", line 193, in forward
    position_ids, rope_deltas = self.get_rope_index(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1057, in get_rope_index
    input_ids = input_ids[attention_mask[i] == 1]
                ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: too many indices for tensor of dimension 1

The issue is we are missing this change in Megatron-Bridge main: https://github.com/NVIDIA-NeMo/Megatron-Bridge/commit/480bdc06091c457481920c7aba6fad195615910f#diff-31acdfd70c4d5550889bba3d1ada09e730c444e10dd1feccdde97abfd8b466eaL178-R180

Steps/Code to reproduce bug

Run the examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-megatrontp2.v1.yaml recipe

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Nov 26 '25 22:11 yfw

Seems this is actually fixed in recent Megatron-Bridge (https://github.com/NVIDIA-NeMo/Megatron-Bridge/commit/d47152e8da4f043f367cf24698373159c017efb0), so we will just need to update the Megatron-Bridge commit we're using

Nov 26 '25 22:11 yfw