Enhancement ? New metric for source separation, measuring separately bleed and fullness in separated audio
Hi,
I've found a simple way to objectively measure bleed and fullness in context of music source separation that I think could be useful as I haven't seen any existing objective metric doing this, while it's a common question from users.
Here is code as a metric:
def bleed_full(ref, est, sr=44100):
# STFT parameters
n_fft = 4096
hop_length = 1024
n_mels = 512
# Compute Mag STFTs
D1 = np.abs(librosa.stft(ref, n_fft=n_fft, hop_length=hop_length))
D2 = np.abs(librosa.stft(est, n_fft=n_fft, hop_length=hop_length))
# Convert to mel spectrograms
mel_basis = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels)
S1_mel = np.dot(mel_basis, D1)
S2_mel = np.dot(mel_basis, D2)
# Convert to decibels
S1_db = librosa.amplitude_to_db(S1_mel)
S2_db = librosa.amplitude_to_db(S2_mel)
# Calculate difference
diff = S2_db - S1_db
# Separate positive and negative differences
positive_diff = diff[diff > 0]
negative_diff = diff[diff < 0]
# Calculate averages
average_positive = np.mean(positive_diff) if len(positive_diff) > 0 else 0
average_negative = np.mean(negative_diff) if len(negative_diff) > 0 else 0
# Scale with 100 as best score
bleedness = 100 / (average_positive + 1)
fullness = 100 / (-average_negative + 1)
return bleedness, fullness
I guess it can be adapted as losses, but I'm not dev/scientist and I'm lacking knowledge to make it bulletproof, if it worth it, you should know better than me.
Same concept can be used to draw spectrograms with, for example: bleed/positive values (red), missing content/negative values (blue), perfect separation = 0 (white):
@jarredou I'm curious about this. So basically:
Instead of doing l1 mel spectral distance, you separate it into two components:
- Bleed = anything ADDED to the target spectrogram
- -Fullness = anything REMOVED from the target spectrogram
I see you do MSS work. I noted in the BS-Roformer paper that the authors wrote: "our model outputs gained more preference from musicians and educators than from music producers in the listening test of SDX23". To my ears, bs-roformers seem to have have less bleed but less fullness. I'd be curious if you have any numbers to share. (cc @ZFTurbo )
@turian Yeah, that's the simple idea behind the 2 metrics.
About the BS-Rofomer quote, it's from this final paper from SDX/MDX23 contest https://arxiv.org/pdf/2308.06979
We don't have numbers between different neural network models. For now, the metrics was only used to evaluate different fine-tuned versions made on top of Kimberley's Melband-Rofomer model the results are accessible here https://docs.google.com/spreadsheets/d/1pPEJpu4tZjTkjPh_F5YjtIyHq8v0SxLnBydfUBUNlbI/edit and it was made using mvsep.com multisong eval dataset.
ZFTurbo has added the torch version of the metric to his training script a few days ago.
Little update: The metric was used as loss, to emphasized fullness on a vocals model and it does great job in the said task, especially on extracting the reverb more fully (also more clarity in the vocals consonants in high frequency range) :
1st pic is Kim's original model, 2nd one is the finetuned version emphasizing on vocals fullness (at a cost of a bit more noisy separation too):
(all these experiments are done inside the Audio Separation discord community (invite: https://discord.gg/ndC4UmPZwZ)
@jarredou Hi! I'm interested in this metric. Did you use it directly and barely as loss using the definition you mentioned? When you mention 'emphasize,' does that imply it should be combined with other traditional losses like waveform loss or STFT loss?
Can we just use STFT magnitude loss with different weights when predict > target and predict < target?
Setting larger weight when predict > target means emphasizing bleedless, and vice versa.