ast
ast copied to clipboard
Some questions about the details of AST.
I would like to know how to explain the classification of audio that can be achieved using ImageNet pretrained models based on spectrograms? As we all know, most of the pictures included in Imagenet are common photos of daily life, such as cats, dogs, cars, etc. Are the features of these pictures/objects correlated with the audio spectrogram? Why can the knowledge learned from traditional pictures be distilled into the classification of spectrograms?
I would appreciate it if you could answer my questions.
Hi there,
This is an interesting question but I don't have a clear answer. It is worth note that using IN pretraining for audio tasks is not new for AST, but can be trace back to 2014.
-Yuan