emotion2vec icon indicating copy to clipboard operation
emotion2vec copied to clipboard

Model is very sensitive to the tiny change of Spectrum

Open mayqinxu opened this issue 11 months ago • 3 comments

Hi, thanks for the great work! But I found that the model exhibits high sensitivity to subtle changes in the frequency and affecting the model's performance greatly. For example, when using funasr to load the model and process audio at 22050hz, the default setting is to use torchaudio for resampling. However, the predicted results from this way differ significantly from using Sox resampling audio before using model for prediction. I compared the spectrogram of the two resampling methods and found the high frequencies in the Sox resampled audio were missing compared to those resampled with torchaudio. After using Audition to remove the high frequency parts the prediction results were correct. Additionally, I tried deleting all content above 4khz and found the prediction results were inaccurate again. It's possible that the model may not have undergone much data augmentation in frequency during training, leading to an over-sensitivity to irrelevant details, which greatly affects the practical use of the model. I wonder if there is a new version that addresses this issue?

mayqinxu avatar Dec 19 '24 09:12 mayqinxu

Here's an example of resampling audio to 16khz using torchaudio and sox:

happy_torchaudio happy_sox and the prediction results are:

torchaudio: rtf_avg: 0.073: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 3.72it/s] [{'key': 'happy_16k', 'labels': ['生气/angry', '厌恶/disgusted', '恐惧/fearful', '开心/happy', '中立/neutral', '其他/other', '难过/sad', '吃惊/surprised', ''], 'scores': [0.00010870184632949531, 5.81611811867333e-06, 3.5695826227311045e-05, 0.29204338788986206, 2.16168409679085e-05, 1.1871205407576468e-11, 4.007801544503309e-05, 0.7077447175979614, 2.7539997192460586e-12]}]

Sox: rtf_avg: 0.078: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 3.51it/s] [{'key': 'asta-happy_ref', 'labels': ['生气/angry', '厌恶/disgusted', '恐惧/fearful', '开心/happy', '中立/neutral', '其他/other', '难过/sad', '吃惊/surprised', ''], 'scores': [0.00015378545504063368, 5.935596163908485e-06, 3.130449113086797e-05, 0.7070081233978271, 1.4909569472365547e-05, 3.165902956459021e-11, 2.4581770048826e-05, 0.2927614450454712, 2.931226442126622e-12]}]

mayqinxu avatar Dec 19 '24 09:12 mayqinxu

That’s a great question! @mayqinxu

I’m still a bit confused about this too since my training audio source on downstream is 44.1kHz. If the source input isn’t 16kHz, what would be the best audio resampling method to resample it to 16kHz? @ddlBoJack

takipipo avatar Dec 25 '24 08:12 takipipo

@mayqinxu @takipipo Any updates on this topic? I'm about to investigate myself. Would be most likely the best to use the same preprocessing as authors used for their 16 kHz training, but I don't see the code for audio resampling.

niemiaszek avatar May 05 '25 16:05 niemiaszek