perceiver-pytorch icon indicating copy to clipboard operation
perceiver-pytorch copied to clipboard

How to use audio data in this model?

Open bugczw opened this issue 3 years ago • 5 comments

Generally speaking, the shape of audio input data is always [batch_size, time_stamps, mel_bins]. But this model parameter, like input_axis, is different in audio. When using for training audio, how do I set the model parameters and preprocess the audio data?

bugczw avatar Mar 25 '21 07:03 bugczw

@bugczw hello! do you want to give the latest version a try?

import torch
from perceiver_pytorch import Perceiver

model = Perceiver(
    input_channels = 3,          # number of channels for each token of the input
    input_axis = 1,              # number of input dimensions (2 for images, 3 for video)
    num_freq_bands = 6,          # number of freq bands, with original value - (2 * K + 1)
    max_freq = 10.,              # maximum frequency, hyperparameter depending on how fine the data is
    depth = 6,                   # depth of net
    num_latents = 256,           # number of latents, or induced set points, or centroids. different papers giving it different names
    cross_dim = 512,             # cross attention dimension
    latent_dim = 512,            # latent dimension
    cross_heads = 1,             # number of heads for cross attention. paper said 1
    latent_heads = 8,            # number of heads for latent self attention, 8
    cross_dim_head = 64,
    latent_dim_head = 64,
    num_classes = 1000,          # output number of classes
    attn_dropout = 0.,
    ff_dropout = 0.,
    weight_tie_layers = False    # whether to weight tie layers (optional, as indicated in the diagram)
)

seq = torch.randn(1, 512, 3) # batch, time, mel bins

model(seq) # (1, 1000)

lucidrains avatar Mar 25 '21 15:03 lucidrains

Thank a lot! And will the weight parameters obtained from the training of this model be published?

bugczw avatar Mar 26 '21 06:03 bugczw

@lucidrains Thanks for your suggestion for how to use spectrogram audio, one cent from me is about the number of the mel bins. It would be usually much more than three, then I'd make a small change with your snippet here:

model = Perceiver(
    input_channels = 64,          # number of channels for each token of the input
         :
)

seq = torch.randn(1, 512, 64) # batch, time, mel bins

model(seq) # (1, 1000)

P.S. seems to be working, model seems to be learning something...

daisukelab avatar Mar 31 '21 15:03 daisukelab

@daisukelab Would you mind preparing some easy Colab for learning audio? It would be nice we would have a collection of different problems solved by one model. Take a look at mine very easy solver for 'object detection' https://colab.research.google.com/drive/1rCZWPpFlgPZC_sqiUtKRSf16rScJi0JW

batrlatom avatar Mar 31 '21 16:03 batrlatom

@batrlatom Hey, excuse me to keep you waiting, but I finally made a Colab notebook for you: https://github.com/daisukelab/sound-clf-pytorch/blob/master/advanced/Perceiver_MelSpecAudio_Example_Colab.ipynb

But I have to say this is not finished because it didn't reach the expected performance. I might update in the future, but I feel we should follow the way the original paper does, by using raw audio.

daisukelab avatar May 04 '21 07:05 daisukelab