Qwen2.5-VL
Qwen2.5-VL copied to clipboard
Question about vision attention masks
Previously, I raised this issue on HF about Qwen2-VL's PatchEmbed implementation and vision masks, which got no responses since then. After diving into the details of recent Qwen2.5-VL, I still hold the same question: the vision masks only apply attention inside one frame (or one window for new Qwen2.5-VL) without intersect attention between multiple frames (or multiple windows in an image). Let me rephrase the question again with the same example: As I tested, the vision mask for one image or two frames input will be like
| 0 | 0 | -inf | -inf |
| 0 | 0 | -inf | -inf |
| -inf | -inf | 0 | 0 |
| -inf | -inf | 0 | 0 |
https://github.com/pyenv/pyenv/blob/master/versions/3.10.16/envs/qwen/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L232 which mask out the intersect attention between the two frames but only apply attention for each frame itself. For Qwen2.5-VL, this idea is borrowed even aggresively at window level. It causes there is no communication between windows in the same image. This happens among most of the blocks in the vision encoder, as there is only 4/32 full attention blocks, while the rest are all window attention. Although, vision and text tokens will be further processed at LM part, this kind of local-focusing mask mechanism may not take effect for multiple frames video understanding and global information interaction / aggregation at vision tower. Did the team conduct ablation expeirments in terms of the training efficiency and attention granularity tradeoff?
Please correct me if I am wrong, looking forward to Qwen team's response, thanks.