Understanding Q, K, V in SegVIT

Open lapiceroazul4 opened this issue 8 months ago • 0 comments

Hi! I'm new to working with Vision Transformers and currently exploring various approaches to image segmentation. While reading the paper, I found the approach taken here quite interesting.

However, I'm having some trouble fully understanding how the Attention to Mask (ATM) mechanism is implemented—specifically, how it modifies or reinterprets the use of queries, keys, and values compared to standard attention mechanisms.

If anyone knows of any resources or explanations that could help clarify this, I’d really appreciate it. Thanks in advance!

May 04 '25 02:05 lapiceroazul4