Conceptual question about cross-view attention
Hi, thanks for the amazing work! I have a small conceptual question about how the proposed cross-view attention mechanism works in the transformer.
If I understand correctly, during cross-view attention, each image patch only attends to image patches at the same spatial location (in image space) in other images, but not the other patches. I'm curious why this makes sense, since clearly in different views, images patches in the same spatial location do not necessarily correspond to same parts of the scene, so the information passed around might not be that useful for reconstruction. Is this a non-issue because of the shifted window approach that swin transformers take, and somehow global information still gets passed around effectively? Or am I misunderstanding how cross-view attention works?
Any guidance would be appreciated!