MiniCPM-V [BUG] get_vllm_embedding中的patch_attn

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

    def get_vllm_embedding(self, data):
        if 'vision_hidden_states' not in data:
            dtype = self.vpm.embeddings.position_embedding.weight.dtype
            device = self.vpm.embeddings.position_embedding.weight.device
            tgt_sizes = data['tgt_sizes']
            pixel_values_list = data['pixel_values']
            best_grid = data["best_grid"]
            vision_hidden_states = []
            all_pixel_values = []
            img_cnt = []
            for pixel_values in pixel_values_list:
                img_cnt.append(len(pixel_values))
                all_pixel_values.extend([i.flatten(end_dim=1).permute(1, 0) for i in pixel_values])

            # exist image
            if all_pixel_values:
                tgt_sizes = torch.vstack(tgt_sizes).type(torch.int32)

                if self.config.batch_vision_input:
                    max_patches = torch.max(tgt_sizes[:, 0] * tgt_sizes[:, 1])

                    all_pixel_values = torch.nn.utils.rnn.pad_sequence(all_pixel_values, batch_first=True,
                                                                       padding_value=0.0)
                    B, L, _ = all_pixel_values.shape
                    all_pixel_values = all_pixel_values.permute(0, 2, 1).reshape(B, 3, -1, L)

                    patch_attn_mask = torch.zeros((B, 1, max_patches), dtype=torch.bool, device=device)
                    for i in range(B):
                        patch_attn_mask[i, :tgt_sizes[i][0] * tgt_sizes[i][1]] = True

patch_attn_mask计算出现问题，索引出错，导致patch_attn_mask全为true

上图的i=4时，有17个padding，应当最后17个为False，但patch_attn_mask最后的结果全为True

https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/main/modeling_minicpmv.py#L97

期望行为 | Expected Behavior

patch_attn_mask[i, :tgt_sizes[i][0] * tgt_sizes[i][1]] = True

修改为

patch_attn_mask[i, 0,:tgt_sizes[i][0] * tgt_sizes[i][1]] = True

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:ubuntu 20.04
- Python: Python 3.10.14
- Transformers: 4.40.0
- PyTorch:2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

No response

Jun 17 '24 04:06 lihua8848

感谢反馈，我们正在评估影响

Jun 18 '24 06:06 iceflame89

你好，这确实是一个 mistake，感谢反馈，为了保证训练和推理的一致性，我们不直接修改 hf 上的代码了，我们会在后续的模型发布中系统性地修复这个问题

Jun 18 '24 08:06 YuzaChongyi

你好，这确实是一个 mistake，感谢反馈，为了保证训练和推理的一致性，我们不直接修改 hf 上的代码了，我们会在后续的模型发布中系统性地修复这个问题 @YuzaChongyi Can you fully assess the impact? We are already fine-tuning the model and applying it to production. Or, when the next model will be released?

There is no problem if the behavior of patch_attn_mask is consistent during the training and inference. We also try to modify it directly, which basically does not change the inference results. This version will not be updated to keep the evaluation results reproducible.

The release date of the next model is not certain yet，we are working for it.

Jun 18 '24 08:06 YuzaChongyi

[BUG] get_vllm_embedding中的patch_attn_mask计算有问题

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?