ImageBind
ImageBind copied to clipboard
Varying the sound length
Fantastic work! I have been evaluating the model using sound files of different lengths. For sounds shorter (500ms in this example) than the 2 second audio clips used to train, I get the following warning: WARNING:root:Large gap between audio n_frames(48) and target_length (204). Is the audio_target_length setting correct?
My question is how do sound clips of varying length affect the embedding output? In other words, can I still use embeddings from shorter clips, or should I duplicate shorter sounds to approximate the 2 seconds expected by the model?
Yes. I have the same question. Maybe padding zero vectors in the end. But I do not know whether such a process will affect the performance.
I have similar questions related to this
Same question
I have the same question.
Has anyone figured out any answer to this? maybe through some empirical experiments even?