perceiver-pytorch
perceiver-pytorch copied to clipboard
How to use audio data in this model?
Generally speaking, the shape of audio input data is always [batch_size, time_stamps, mel_bins]. But this model parameter, like input_axis, is different in audio. When using for training audio, how do I set the model parameters and preprocess the audio data?
@bugczw hello! do you want to give the latest version a try?
import torch
from perceiver_pytorch import Perceiver
model = Perceiver(
input_channels = 3, # number of channels for each token of the input
input_axis = 1, # number of input dimensions (2 for images, 3 for video)
num_freq_bands = 6, # number of freq bands, with original value - (2 * K + 1)
max_freq = 10., # maximum frequency, hyperparameter depending on how fine the data is
depth = 6, # depth of net
num_latents = 256, # number of latents, or induced set points, or centroids. different papers giving it different names
cross_dim = 512, # cross attention dimension
latent_dim = 512, # latent dimension
cross_heads = 1, # number of heads for cross attention. paper said 1
latent_heads = 8, # number of heads for latent self attention, 8
cross_dim_head = 64,
latent_dim_head = 64,
num_classes = 1000, # output number of classes
attn_dropout = 0.,
ff_dropout = 0.,
weight_tie_layers = False # whether to weight tie layers (optional, as indicated in the diagram)
)
seq = torch.randn(1, 512, 3) # batch, time, mel bins
model(seq) # (1, 1000)
Thank a lot! And will the weight parameters obtained from the training of this model be published?
@lucidrains Thanks for your suggestion for how to use spectrogram audio, one cent from me is about the number of the mel bins. It would be usually much more than three, then I'd make a small change with your snippet here:
model = Perceiver(
input_channels = 64, # number of channels for each token of the input
:
)
seq = torch.randn(1, 512, 64) # batch, time, mel bins
model(seq) # (1, 1000)
P.S. seems to be working, model seems to be learning something...
@daisukelab Would you mind preparing some easy Colab for learning audio? It would be nice we would have a collection of different problems solved by one model. Take a look at mine very easy solver for 'object detection' https://colab.research.google.com/drive/1rCZWPpFlgPZC_sqiUtKRSf16rScJi0JW
@batrlatom Hey, excuse me to keep you waiting, but I finally made a Colab notebook for you: https://github.com/daisukelab/sound-clf-pytorch/blob/master/advanced/Perceiver_MelSpecAudio_Example_Colab.ipynb
But I have to say this is not finished because it didn't reach the expected performance. I might update in the future, but I feel we should follow the way the original paper does, by using raw audio.