RegionProxy
RegionProxy copied to clipboard
Region embeddings
Where exactly do we pass the region embeddings as tokens to the transformer encoder?
All I can see is that the token and affinity both are defined at decoder head
It seems that the author uses in_index and out_index in the config files to select one of the middle output features of Vit backbone as affinity head's input. as 68~71 lines in proxy_head.py:
def forward(self, inputs):
x_mid, x = self._transform_inputs(inputs) # (B, C, H, W)
B, _, H, W = x.shape
affinity = self.forward_affinity(x_mid)