MNN 模型导出时，Multimodal Rotary Position Embedding (M-ROPE)疑问?

在 llmexport.py 文件中并没有看到 Multimodal Rotary Position Embedding (M-ROPE) 相关的处理所有的 position id 都是这样的：

    def get_position_ids(self) -> torch.Tensor:
        if self.model_type == 'chatglm':
            return self.chatglm_position_ids()
        if self.token_len:
            return torch.tensor([[self.seq_len - 1]], dtype=torch.int)
        return torch.arange(self.seq_len, dtype=torch.int).unsqueeze(0)

但是正常的实现应该如下图这样 3D 的 postion id

请问目前代码里面这样处理是否对 qwen2vl / qwen2.5vl 的结果有影响？或者是在其它地方处理了 (M-ROPE) ？

Mar 25 '25 07:03 cloudyuyuyu

已经进行处理了的，可以在 llmexport.py 和 vision.py 里面找一下

Apr 08 '25 11:04 jxt1234

已经进行处理了的，可以在 llmexport.py 和 vision.py 里面找一下

我们又check了一下，确实没有实现 (M-ROPE)，M-ROPE 是在模块D部分：合并文本和图像 embedding的时候将位置编码ID重构成三维向量[t,h,w]

举个例子： prompt_head_len = 5 图片编码成 2 * 2 = 4 个 token

import torch

position_ids = torch.arange(0, 10, dtype=torch.float16).repeat(3, 1, 1)

tensor([[[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]], [[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]], [[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]]], dtype=torch.float16)

position_ids[0, :, 5: 9] = 5

tensor([[[0., 1., 2., 3., 4., 5., 5., 5., 5., 9.]], [[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]], [[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]]], dtype=torch.float16)

position_ids[1, :, 5: 7] = 5
position_ids[1, :, 7: 9] = 6

tensor([[[0., 1., 2., 3., 4., 5., 5., 5., 5., 9.]], [[0., 1., 2., 3., 4., 5., 5., 6., 6., 9.]], [[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]]], dtype=torch.float16)

最终的 Position Ids 如下所示：

position_ids[2, :, 5: 7] = torch.arange(5, 7, dtype=torch.float16)
position_ids[2, :, 7: 9] = torch.arange(5, 7, dtype=torch.float16)

tensor([[[0., 1., 2., 3., 4., 5., 5., 5., 5., 9.]], [[0., 1., 2., 3., 4., 5., 5., 6., 6., 9.]], [[0., 1., 2., 3., 4., 5., 6., 5., 6., 9.]]], dtype=torch.float16)

在当前版本实现的逻辑中，最终的文本部分使用的都是一维的位置ID

图像侧使用的是二维的 VisionRotray 位置编码，但是都是最终模块 D 部分【合并文本和图像 embedding的时候使用的 M-Rope】不一样。

请再次评估这个问题，麻烦了

Apr 10 '25 09:04 cloudyuyuyu

感谢提问，我们再确认一下

Apr 18 '25 03:04 wangzhaode

检查了代码，我们之前支持的Qwen2-VL和Qwen2.5-VL没有支持视频输出，单次推理只支持单张图片输入；因此位置编码没有使用m_rope，现在这部分功能已经在支持了

Apr 23 '25 09:04 wangzhaode

检查了代码，我们之前支持的Qwen2-VL和Qwen2.5-VL没有支持视频输出，单次推理只支持单张图片输入；因此位置编码没有使用m_rope，现在这部分功能已经在支持了

感谢回复期待后续版本的更新

Apr 24 '25 01:04 cloudyuyuyu

已更新，在 https://github.com/alibaba/MNN/pull/3505 中添加了支持

May 08 '25 05:05 wangzhaode

代码实现：https://github.com/alibaba/MNN/blob/ebb8c8ff86b9bd15d6f3ca47a552e9ee11dbbefa/transformers/llm/engine/src/omni.cpp#L519

May 08 '25 05:05 wangzhaode

代码实现：

MNN/transformers/llm/engine/src/omni.cpp

Line 519 in ebb8c8f

VARP Omni::gen_position_ids(int seq_len) {

大佬，apps\Android\MnnLlmChat 这个apk 好多bug，根本编译不了，是不是漏传了很多文件啊：

比如这个目录就没有传 com.alibaba.mnnllm.android.chat.model

May 08 '25 08:05 cloudyuyuyu

代码实现： MNN/transformers/llm/engine/src/omni.cpp Line 519 in ebb8c8f VARP Omni::gen_position_ids(int seq_len) {

大佬，apps\Android\MnnLlmChat 这个apk 好多bug，根本编译不了，是不是漏传了很多文件啊：

比如这个目录就没有传 com.alibaba.mnnllm.android.chat.model

稍等，我们检查一下

May 08 '25 08:05 wangzhaode

@Juude

May 08 '25 08:05 wangzhaode

https://github.com/alibaba/MNN/pull/3506

已上传

May 08 '25 08:05 wangzhaode

#3506

已上传

这下可以了，感谢

May 08 '25 09:05 cloudyuyuyu