question regarding reproduction on vggsound dataset.
Thanks for the good work! However, I'm unable to reproduce the reported results of VGGsound-C in the paper, I notice two places that may cause this, could the authors help me out?
-
different hyper-param setting, mainly, is the mean and std correct? Could the author release the full cmd for the vggsound dataset? I'm running with the following hyper-params and getting much worse results. the dataset='vggsound', json_root='./json_csv_files/VGGSound', label_csv='./json_csv_files/class_labels_indices_vgg.csv', model='cav-mae-ft', dataset_mean=-5.081, dataset_std=4.4849, target_length=1024, lr=0.001, weight_decay=0.001, optim='adam', batch_size=64, num_workers=0, pretrain_path='./pretrained_model/vgg_65.5.pth', gpu='0', testmode='multimodal', tta_method='READ', corruption_modality='audio', severity_start=5, severity_end=5,
-
the released vggsound pretrained model does not have the attention fusion layer, you can tell by its file size of cav_mae_ks50.pth and vgg_65.5.pth. Could the author release the full pre-trained model or explain the difference?
Thanks for your time!
Hi,
Sorry for the delayed response. I’ve downloaded the repo, conducted the experiment on VGGSound with Gaussian-5 using the released CMD, and successfully reproduced the results (~40.3).
I noticed that the posted command adjusts the learning rate and adds weight decay, which seems to be significant.
The model vgg_65.5.pth was directly taken from this link, while cav_mae_ks50.pth was fine-tuned by the authors as mentioned in issue #5.
Thanks
Thanks for the reply. I get normal results now.
Though I am still curious why, even with part of the model (attention fusion layer) are not pre-trained, TTA with self-supervised training target can work. Is the authors have a good explanation?