cav-mae
cav-mae copied to clipboard
Finetune CAVMAE on ESC50
Hi Yuan, did you finetune CAVMAE on ESC50 dataset? Could you advise me what is the training pipline? Thank you very much.
No, we didn't do ESC-50 experiments with CAV-MAE. But I expect it has similar or better performance compared with AST.
In general, we cleaned and released all codes in the main manuscript and part of the appendix. It is hard for me to clean up the rest as I have limited time. ESC-50 experiments are not in the main manuscript or appendix, we honestly don't have that.
-Yuan
You can refer to the audioonly recipe and AST esc50 recipe to do yourself.
Hi Yuan, thanks for your suggestions. I tried the ESC50 but I just got about 88% accuracy. In the implementation, when I load the checkpoint and I got the error about the mismatch dimension of 'module.pos_embed_a'. I know this is caused by the different audio length. For ESC50, the target length is set as 512. What I was doing is that the parameter of 'module.pos_embed_a' is not loaded and update the 'module.pos_embed_a' with new seq_length from stracth during the ESC50 training. I am not sure if it affacts the performance.
It would be much better to trim the module.pos_embed_a
to desired length instead of randomly initializing it.
Another method is just to pad all ESC-50 recordings to 10s, the script should automatically do that.
Btw, 88% isn't bad for a model without supervised AudioSet training. For better results, start with an AudioSet supervised pretrained checkpoint, e.g., https://github.com/YuanGongND/cav-mae#cav-mae-pretrainedfinetuned-models.
Hi Yuan, many thanks for your patient response. I try to trim the 'module.pos_embed_a' to match the desired length. First, after loading the pretrained model, the 'module.pos_embed_a' has a shape of [1, 512, 768]. Then, it is reshaped into [1, 768, 8, 64]. Because the length of ESC audios is about a half of audioset, the shape of desied positional embed is [1, 768, 8, 32]. Finally, it is reshaped into [1, 256, 768]. Do you think it is reasonable? In addition, I use the [16, 16] stride instead of [10, 10], I am not sure if stride has an important influence on performance? Whatis more, I would like to check with you whether the positional embedding is learnable or fixed?
hi, I apologize but I don't have time to follow up on issues about a new application/dataset with CAV-MAE, especially for competing performance. Usually, some tuning is needed, e.g., learning rate, batch size, etc.
First, after loading the pretrained model, the 'module.pos_embed_a' has a shape of [1, 512, 768]. Then, it is reshaped into [1, 768, 8, 64]. Because the length of ESC audios is about a half of audioset, the shape of desied positional embed is [1, 768, 8, 32]. Finally, it is reshaped into [1, 256, 768].
You need to check the order of time and frequency dim. torch.reshape
is different from torch.permute
, when you convert [1,512,768] to [1,768,512], you need torch.permute
instead of reshape
. There might be other things need to take care.
In addition, I use the [16, 16] stride instead of [10, 10], I am not sure if stride has an important influence on performance?
Code in this repo only support [16,16] stride (no overlap). For other strides, you need to implement by yourself.
Whatis more, I would like to check with you whether the positional embedding is learnable or fixed?
Please check the code. I cannot recall, but I can recall performance-wise they are very similar.
I might not be able to follow up on this further.