SegVit
SegVit copied to clipboard
Understanding Q, K, V in SegVIT
Hi! I'm new to working with Vision Transformers and currently exploring various approaches to image segmentation. While reading the paper, I found the approach taken here quite interesting.
However, I'm having some trouble fully understanding how the Attention to Mask (ATM) mechanism is implemented—specifically, how it modifies or reinterprets the use of queries, keys, and values compared to standard attention mechanisms.
If anyone knows of any resources or explanations that could help clarify this, I’d really appreciate it. Thanks in advance!