sigsep-mus-eval
sigsep-mus-eval copied to clipboard
add sdr metric from MDX challenge
we want to add the simplified SDR metric from the MDX challenge, and we realized that SDR isn't clear. So we are willing to rename the simplified metric:
Please vote:
1️⃣ uSDR as coined by @luo42 in https://arxiv.org/abs/2209.15174
2️⃣ NSDR as coined by @adefossez in https://github.com/facebookresearch/demucs
3️⃣ SDR as used in the Challenge https://www.aicrowd.com/challenges/sound-demixing-challenge-2023
I got some feedback that this new metric is really just a SNR. Do you think SNR would be a good name ? Maybe TSNR for Track level SNR to emphasize the computation over the entire track no segments ? Because SDR was scale invariant but the new one is not, it could be a bit misleading to reuse a name so close like nSDR, even if nothing says scale invariant in SDR.
Hi Alexandre, actually only the source-version of the BSS Eval is scale-invariant but the image-version is not (see, e.g., (2.1) on page 25 in https://theses.hal.science/tel-01684685/document).
4️⃣ TSNR track-level SNR as proposed by @adefossez
5️⃣ uSNR utterance-level SNR
actually lets do a vote on twitter: https://twitter.com/faroit/status/1620414432395558912?s=20
To me, this is simply SNR. Which can be gamed by rescaling the mixture... I'm actually worried that this is the right metric, tbh.
@Jonathan-LeRoux but that is the same problem for the "real" SDR and we usually can't really do scale-invariance for music separation coming from the applications... 🤷♂️
We will have a perceptual part in the challenge this time but we need to drop SNR/SDR very soon, I agree.
Following up on what I wrote on Twitter: I looked at the MDX paper and it doesn't look like the final metric uses any median averaging, it's all classical averages. In which case I think anybody reading "SNR" would imagine they'd compute the SNR of a whole song over the 2 channels then average that over songs, separately for each instrument (and average again to get the final metric). Regarding the relevance of the metric and the issue with scale invariance: I agree that allowing scale invariance would be odd for music applications. And in most cases, one can hope that the systems are doing a good enough job at removing other sources that there is no significant game to be played by a simple rescaling. But clearly that's not the case for the mixture, and the rescaled mixture is used as a baseline, which I find misleading. One option could be to ensure mixture consistency across the stems, but that could also penalize some methods...