MiniCPM-V [BUG] number of image start tokens and image end tokens mismatch

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

When the length of image_start_tokens and the length of image_end_tokens are not equal, valid_image_nums will be the greater one, causing torch.hstack to fail due to tensor size mismatch. Should max be min? https://huggingface.co/openbmb/MiniCPM-V-2_6/blob/main/processing_minicpmv.py#L119

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

run the video example with video_path="./assets/demo_video.mp4" https://github.com/OpenBMB/MiniCPM-V?tab=readme-ov-file#chat-with-video

运行环境 | Environment

- OS: Ubuntu 20.04
- Python: 3.10
- Transformers: 4.40.0
- PyTorch: 2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 11.8

备注 | Anything else?

No response

Aug 18 '24 13:08 Aguin

hello, Maybe your model length setting is too small and the video is too long.

Aug 21 '24 06:08 LDLINGLINGLING

hello, Maybe your model length setting is too small and the video is too long.

@LDLINGLINGLING yes, downsampling can solve this, but I still think L119 is incorrect since it will stack two tensors of different lengths.

Aug 21 '24 06:08 Aguin

Where to set the model length ?

Aug 22 '24 12:08 Liu0329

Where to set the model length ?

set MAX_NUM_FRAMES=40 #64 # if cuda OOM set a smaller number when inference videos maybe 40 is the largest, due to the max_num of tokens is 8192

Sep 05 '24 08:09 nanamma