Does the model support the variable audio length?
Hi, Thanks for sharing this great work! I have a lot of audio files, but the length are different, so i want to know if the model support the variable audio length? or another question, you know, some audio events need more length to get the better embedding to get a good classification result, (maybe i can remove the silence, but i do not know how to keep the spectrometer smooth, if remove the silence and make a step in wave, i think the spectrometer is polluted, and the embedding maybe have some problem), so how to process those audio files? Thanks. Looking forward to your reply.
Hi, Thank you for your interest!
Yes the model support variable audio lengths, as this was a requirement for the HEAR challenge. However, our models were trained on 10-second clips. Therefore, the trained time-positional-encodings is only available for 10-seconds. In the challenge, in order to get scene embedings for longer audio clips, we use a simple approach: to average the predictions of windows of 10-seconds (with overlap) as implemented here
For the timestamp embedding, we submitted a model "base" with 160 ms window, and a model "2 Level" with a larger window 800 ms. Precisely, we concatenated the embeding as implemented here. Genrally, in the results, the 2 level model performed better, with the exception of FSD50K.
I'm not sure which preprocessing method would be the best, but I'd guess that silence trimming won't affect the performance to a large extent.
I hope this helps!
Thanks for your reply, I learned a lot. Recently i want to retrain the model use my data, and i want to fine-tune pretained model for 2-class, so how can i do it? can you give some examples? thanks a lot !
Hi! Sure here I call this function. You can use the argument n_classes=2 then you'll get a model with pretrained embeding and a new classifier, you can fine-tune it on your task.
@kkoutini likely related to this:
Is it correct that the default img_size of img_size=(128, 998) does through a warning?
/usr/local/lib/python3.7/dist-packages/hear21passt/models/passt.py:260: UserWarning: Input image size (128*1000) doesn't match model (128*998).
warnings.warn(f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]}).")
yes, unfortunately the 998 was due to a pre-processing bug in the original pre-trained weights and should have a minimal effect on the output if you input 1000 frames (the last 2 frames will be ignored by the model).
yes, unfortunately the
998was due to a pre-processing bug in the original pre-trained weights and should have a minimal effect on the output if you input 1000 frames (the last 2 frames will be ignored by the model).
@kkoutini thanks for the pointer. I'm still not sure why this is also raised for inputs with less than 998/1000 frames. Is this due to pos-enc interpolation?
is there any use of changing parameters like scene_embedding_size? My inputs are 20s duration....
I'm still not sure why this is also raised for inputs with less than 998/1000 frames. Is this due to pos-enc interpolation?
This warning is always shown when the input size doesn't equal the size provided when training the model.
is there any use of changing parameters like scene_embedding_size? My inputs are 20s duration....
Unfortunately the scene_embedding_size come from the embedding_size(768) + logits (527) of the pretrained model. In later experiments, we found out that using the embedding only yields better performance on the HEAR tasks. Therefore, one option would be to set scene_embedding_size=768 and mode="embed_only"
@kkoutini thanks! I guess this can be closed then