vipant icon indicating copy to clipboard operation
vipant copied to clipboard

Getting text/audio embeddings (and their gradients) from the pretrained models.

Open hugofloresgarcia opened this issue 2 years ago • 3 comments

Hi!

First of all, awesome work! I'd like to be able to load a pretrained model so I can obtain text and audio embeddings (and their gradients). Could you help me figure out how to do that?

Thanks! :)

hugofloresgarcia avatar Aug 12 '22 15:08 hugofloresgarcia

Also, I noticed that the audio preprocessing is done within kaldi, and not with torch preprocessing. Would that mean that I wouldn't be able to run gradients through the mel spectrogram transform?

hugofloresgarcia avatar Aug 12 '22 16:08 hugofloresgarcia

hi! I am using kaldi APIs of torchaudio. I think you are right: the transform function does not seem to produce any gradients, so no way to run gradients through fbank / spectrogram transformation.

If you want to get the gradients of audio embeddings output by the audio encoder, maybe try torch.autograd.grad.

zhaoyanpeng avatar Aug 13 '22 17:08 zhaoyanpeng

if you want to encode your own audio-text data w/ a pre-trained VA model, you would need to modify this function to directly save the audio and text features.

You might find this AT retrieval script helpful.

Let me know if that works for you!

zhaoyanpeng avatar Aug 13 '22 19:08 zhaoyanpeng