UNINEXT icon indicating copy to clipboard operation
UNINEXT copied to clipboard

About the text encoder?

Open liuheng92 opened this issue 1 year ago • 1 comments

Hi, Thx for the great job, but there is a discrepancy that doesn't match the content in the essay. In the essay, It says "We adopt BERT [26] as the text encoder and its parameters are trained in the first and second training stages while being frozen in the last training stage.", but the repo shows that we should download the pretrained vit model. So I am a little confused if I should use the origin vit model or the finetuned one? and where is the finetuned text encoder model?

liuheng92 avatar Jan 16 '24 02:01 liuheng92

Hi, ViT should be downloaded only when you want to use it as the visual encoder (visual backbone). As mentioned in the paper, we always use BERT-base as the text encoder. The code of the text encoder is here.

About the finetuning problem, we always initialize the visual encoder and the text encoder using ImageNet pretrained and Huggingface pretrained weights. During the training process, their weights will be finetuned using instance perception data.

MasterBin-IIAU avatar Jan 16 '24 08:01 MasterBin-IIAU