tokenize-anything How are region-level descriptions obtained?

How are region-level descriptions obtained?

Open UcanSee opened this issue 1 year ago • 1 comments

Thanks for your great work！ In your paper, the label of the semantic classification branch is mask cropped embedding obtained by CLIP, then how is the GT of the caption branch generated from SA-1B?

Apr 22 '24 07:04 UcanSee

Hi, @UcanSee

Caption branch (i.e., TextDecoder) is randomly initialized, but is NOT trained during SA-1B pre-training.
Caption branch is then trained only on VG data, with the frozen ImageEncoder & ImageDecoder.
Further e2e fine-tuning for Caption branch on a mixed SA/VG dataset (set caption loss to zero for SA data), could improve VG CIDEr from 154.7 to 164.7.

Apr 22 '24 08:04 PhyscalX

tokenize-anything tokenize-anything copied to clipboard

How are region-level descriptions obtained?

tokenize-anything
tokenize-anything copied to clipboard