recognize-anything pre-train 和 finetune 阶段，swin-transformer、CLIP text encoder和 tag embedding 都反传梯度吗，学习率也都是相同的吗？

Nov 23 '23 12:11 tigerzjh

CLIP text encode和tag embedding都是固定的，详情请看RAM++论文

Nov 24 '23 02:11 xinyu1205

@xinyu1205 文章里我看只有三个散落的地方提到梯度回传（哪些部分参数要训练）： 1、“figure 3”画了Text encoder 是 frozen 2、《A. More Implementation Details》章节“we employ the CLIP image encoder paired with the frozen text encoder to distill image feature, making full use of its original image text alignment properties” 3、《Implementation Details》段落：“We employ the SwinBase [32] pre-trained on ImageNet [10] as the image encoder, and select base-scale models across other comparative methods for fair comparison. We leverage the off-the-shelf text encoder from CLIP [43] to extract text and tag description embeddings. We adopt the robust alignment loss function of ASL [46] for both image-text alignment and image tagging. ” 所以：我们的模型大致分为三个部分：CLIP Text encoder、swin transformer、align decoder的话。在pretrain 和finetune 两个阶段都是swin transformer、align decoder 更新参数，CLIP Text encoder冻结，是这样理解吗？

Nov 24 '23 02:11 tigerzjh

是的，除了RAM++，同样推荐阅读新的RAM++论文

Nov 24 '23 02:11 xinyu1205

两个必须都看了，这个工作真的很顶，哈哈

Nov 24 '23 02:11 tigerzjh

@xinyu1205 RAM ++ 对比RAM 主要改进在于：

文本不再用可学习的query，改成了GPT写的句子，然后用CLIP 文本编码器编码（训练&测试）
整个句子的损失不在是生成损失，变为了ASL损失，整个模型进行了精简。 RAM 的Image Tag recognize decoder 和 RAM ++ 的alignment decoder 几乎是参数量、结构啥相同？可以这么理解？

Nov 24 '23 02:11 tigerzjh

是的，你的理解非常正确

Nov 24 '23 07:11 xinyu1205

感谢大佬 @xinyu1205

Nov 24 '23 07:11 tigerzjh