Pretraining data of released source models.
Hi,
Amazing work! I'm just curious about the pretraining dataset used for experiments on Kinetics-50C. Is the original CAV-MAE model completely fine-tuned on Kinetics-50 or are the CAV-MAE weights, initialized from VGGSound, fixed with "only" the classifiers being fine-tuned?
I don't understand this statement in the Appendix (for both datasets) - "During the fine-tuning phase, we maintain the visual and audio encoders of the pre-trained model and add one randomly initialized classification head upon them." What are the pre-trained model weights here?
Hi,
We just followed the finetuning pipeline of CAV-MAE on the VGGSound dataset to get cav_mae_ks50.pth. The main modification is that I replaced the label weight file (NOTE: not model weight) with the one from the KS50 dataset. For usage, you can directly access the finetuned checkpoint that has been uploaded.
Therefore, following CAV-MAE, the pre-trained weights for VGGSound and Kinetics50 fine-tuning are inherited from CAV-MAE Pretrained Model "cav-mae-scale++". The fine-tuning dataset used for experiments on Kinetics-50C are detailed on Appendix B in our paper. You can find the json file for the dataset in link.