vipant
vipant copied to clipboard
Getting text/audio embeddings (and their gradients) from the pretrained models.
Hi!
First of all, awesome work! I'd like to be able to load a pretrained model so I can obtain text and audio embeddings (and their gradients). Could you help me figure out how to do that?
Thanks! :)
Also, I noticed that the audio preprocessing is done within kaldi, and not with torch preprocessing. Would that mean that I wouldn't be able to run gradients through the mel spectrogram transform?
hi! I am using kaldi APIs of torchaudio. I think you are right: the transform function does not seem to produce any gradients, so no way to run gradients through fbank / spectrogram transformation.
If you want to get the gradients of audio embeddings output by the audio encoder, maybe try torch.autograd.grad.