GroupViT
GroupViT copied to clipboard
Multi-Label Image-Text Contrastive Loss
Hi!Very good work. I have some questions. Why not consider aligning the 8 segment tokens with the generated text? would this be better
Hi @pzhren ,
Truly sorry for the late reply.
Since we don't use the ground truth mask, it's difficult to define the correct match, but we did tried some matching between text and segment tokens, which doesn't lead to the improvement.