the query to patch attention for linear projection

Open darkpromise98 opened this issue 1 year ago • 1 comments

A very impressive work for MLLM interpretability.

I want to know how to compute the query-to-patch attention map (the top lines of Fig. 3) for linear projection (e.g., LLaVA), since the each query is obtained by the one-to-one mapping of each visual patch.

Nov 08 '24 07:11 darkpromise98

Hi, since the query token and the patch token have a one-to-one mapping, meaning the i/576 query token corresponds exactly to the i/576 patch token, we directly visualize the 576 (24x24) patch grids on the raw images (which looks like a whole image) without additional computation. We acknowledge that the query-to-patch visualization approach for linear projection (first row) is slightly different from that for compressive projection (second and third rows) to improve clarity.

Nov 10 '24 09:11 yaolinli