Yuan Gong

Results 80 comments of Yuan Gong

I cannot tell the reason either. But there's no RGB concept in the audio spectrogram. It is just 1-d information. [128,1024] means 128 frequency bins, 1024 time frames, which looks...

The length depends on `input_tdim`, for your case, you should modify `run.py` to set `input_tdim=250`. `timem` should be smaller than `input_tdim`. Again, I suggest starting from either the speechcommands or...

OK, I finally find the reason. This is due to a `torchaudio` issue. We use `torchaudio 0.8.1`, in which the input of the masking can be [freq, time] while the...

You can use the Colab script to find the bug https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/torchaudio_SpecMasking_1_1.ipynb

Hi there, Thanks for your interest. The transformer itself accepts variable-length input, but that requires some engineering (e.g., bucketing sequence with similar lengths). We didn't implement it in the code,...

Hi Daniel, There are a few things. > I don't understand what you mean by 'majority voting' in my test set, but I'll just decide on an audio length for...

Hi Daniel, > I have been trying different stuff and indeed in some cases AST outperforms my current model (not really when resampling at 16K and/or using audioset pretrain tho,...

> I just have some question though, does the teacher model needs to be already trained when using it through the KD training process? We always use pretrained teacher because...

To use DEIT initialization, we have to initialize in the same way as DEIT, but as you point out, we average it in the forward pass. Good luck with your...

Thanks for your interest. I think it is not an overfitting issue as you should also see a performance drop in mAP or accuracy on the validation set if the...