emotion2vec
emotion2vec copied to clipboard
Model is very sensitive to the tiny change of Spectrum
Hi, thanks for the great work! But I found that the model exhibits high sensitivity to subtle changes in the frequency and affecting the model's performance greatly. For example, when using funasr to load the model and process audio at 22050hz, the default setting is to use torchaudio for resampling. However, the predicted results from this way differ significantly from using Sox resampling audio before using model for prediction. I compared the spectrogram of the two resampling methods and found the high frequencies in the Sox resampled audio were missing compared to those resampled with torchaudio. After using Audition to remove the high frequency parts the prediction results were correct. Additionally, I tried deleting all content above 4khz and found the prediction results were inaccurate again. It's possible that the model may not have undergone much data augmentation in frequency during training, leading to an over-sensitivity to irrelevant details, which greatly affects the practical use of the model. I wonder if there is a new version that addresses this issue?
Here's an example of resampling audio to 16khz using torchaudio and sox:
and the prediction results are:
torchaudio:
rtf_avg: 0.073: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 3.72it/s]
[{'key': 'happy_16k', 'labels': ['生气/angry', '厌恶/disgusted', '恐惧/fearful', '开心/happy', '中立/neutral', '其他/other', '难过/sad', '吃惊/surprised', '
Sox:
rtf_avg: 0.078: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 3.51it/s]
[{'key': 'asta-happy_ref', 'labels': ['生气/angry', '厌恶/disgusted', '恐惧/fearful', '开心/happy', '中立/neutral', '其他/other', '难过/sad', '吃惊/surprised', '
That’s a great question! @mayqinxu
I’m still a bit confused about this too since my training audio source on downstream is 44.1kHz. If the source input isn’t 16kHz, what would be the best audio resampling method to resample it to 16kHz? @ddlBoJack
@mayqinxu @takipipo Any updates on this topic? I'm about to investigate myself. Would be most likely the best to use the same preprocessing as authors used for their 16 kHz training, but I don't see the code for audio resampling.