encodec icon indicating copy to clipboard operation
encodec copied to clipboard

About audio quality evaluation

Open sh-lee-prml opened this issue 2 years ago • 3 comments

❓ Questions

Thank you for nice work.

I have some question about objective evaluation metrics.

  1. Are these metrics (SI-SNR and ViSQOL) consistent with audio quality perceptually?

I know that it is very difficult to evaluate the audio quality. So I'm so curious how to evaluate the model during ablation studies or during training.

  1. MS-STFT discriminator (Complex) VS MS-STFT discriminator (Real)

How about the quality of the model with MS-STFT discriminator using only real value? It would be appreciated if you could share such information.

Thank you!

sh-lee-prml avatar Nov 02 '22 15:11 sh-lee-prml

Hi @sh-lee-prml,

  1. These metrics consistent with human perception up to some level, they are definitely not perfect, but can give you a nice proxy for faster model evaluation. Anyways, we recommend running full subjective tests in addition to these metrics once you reach a good enough model.
  2. During preliminary experiments we found that the complex MS-STFT was better for us, so we kept developing it. So as we did not do an in-depth comparison, I can not say much about the exact differences in quality between them.

adiyoss avatar Nov 02 '22 15:11 adiyoss

Thank you for your quick reply!

When I compared the models trained with the complex MS-STFT and real MS-STFT discriminator, they have similar performance on Mel reconstruction error and PESQ. But, In my personal evaluation, the complex MS-STFT version is slightly better in some samples. So, I asked that these metrics are really useful? I will try it and share the result :)

and How do you think about the Mel-spectrogram reconstruction error(or MS-STFT error) or PESQ as evaluation metrics?

thank you!

sh-lee-prml avatar Nov 02 '22 15:11 sh-lee-prml

  1. Are these metrics (SI-SNR and ViSQOL) consistent with audio quality perceptually?

I know that it is very difficult to evaluate the audio quality. So I'm so curious how to evaluate the model during ablation studies or during training.

For SpeechEnhancement DNSMOS did really give a better objective evaluation when subjective evaluation isn't feasible. Not sure if the same applies for this problem (even though it's a non-intrusive evaluation metric). I would suggest giving DNSMOS a try.

stonelazy avatar Nov 11 '22 01:11 stonelazy

DNSMos definitely look interesting! Visqol is supposed to correlate to human perception but definitely has a lot of shortcomings and fail to capture the presence of artifact. While we are unlikely to go back to the evaluation stage of Encodec, having already conducted extensive evaluation, we will keep that in mind for future work!

adefossez avatar Nov 17 '22 16:11 adefossez

@sh-lee-prml: is it possible for you to share your training script?

listener17 avatar Apr 05 '23 15:04 listener17