ImageBind Varying the sound length

Varying the sound length

Open datovar4 opened this issue 2 years ago • 5 comments

Fantastic work! I have been evaluating the model using sound files of different lengths. For sounds shorter (500ms in this example) than the 2 second audio clips used to train, I get the following warning: WARNING:root:Large gap between audio n_frames(48) and target_length (204). Is the audio_target_length setting correct?

My question is how do sound clips of varying length affect the embedding output? In other words, can I still use embeddings from shorter clips, or should I duplicate shorter sounds to approximate the 2 seconds expected by the model?