open-unmix-pytorch
open-unmix-pytorch copied to clipboard
[Question] Ideal/oracle performance of source estimate + mix phase
Hello, I've been interested in running various oracle benchmark methods to check if different types of spectrogram (CQT, etc.) can be useful for source separation. Initially, I was working with the IRM1/2 and IBM1/2 from https://github.com/sigsep/sigsep-mus-oracle
However I noticed that Open-Unmix uses the strategy of "estimate of source magnitude + phase of original mix" (but it has an option to use soft masking instead). Is it valuable to create an "oracle phase-inversion" method?
So, soft mask/IRM1 "ceiling" of performance (the known IRM1 oracle mask calculation) is like (using vocals stem as an example):
mix = <load mix> # mixed track
vocals_gt = <load vocals stem> # ground truth
vocals_irm1 = abs(stft(vocals_gt)) / abs(stft(mix))
vocals_est = istft(vocals_irm1 * stft(mix)) # estimate after "round trip" through soft mask
Now, for the phase inversion method, we could do the following:
mix = <load mix> # mixed track
vocals_gt = <load vocals stem> # ground truth
mix_phase = phase(stft(mix))
vocals_gt_magnitude = abs(stft(vocals_gt))
vocals_stft = pol2cart(vocals_gt_magnitude, mix_phase)
vocals_est = istft(vocals_stft) # estimate after "round trip" through phase inversion
Does this make sense to do? Has anybody done this before? What could this method be called?
OK, it seems to be working. Here's a piece of code, hacked together from https://github.com/sigsep/sigsep-mus-oracle/blob/master/IRM.py and unmix:
def atan2(y, x):
r"""Element-wise arctangent function of y/x.
copied from umx, replace torch with np
"""
pi = 2 * np.arcsin(1.0)
x += ((x == 0) & (y == 0)) * 1.0
out = np.arctan(y / x)
out += ((y >= 0) & (x < 0)) * pi
out -= ((y < 0) & (x < 0)) * pi
out *= 1 - ((y > 0) & (x == 0)) * 1.0
out += ((y > 0) & (x == 0)) * (pi / 2)
out *= 1 - ((y < 0) & (x == 0)) * 1.0
out += ((y < 0) & (x == 0)) * (-pi / 2)
return out
def ideal_mixphase(track, eval_dir=None):
"""
ideal performance of magnitude from estimated source + phase of mix
which is the default umx strategy for separation
"""
X = stft(track.audio.T, nperseg=4096, noverlap=1024)[-1].astype(np.complex64)
(I, F, T) = X.shape
# Compute sources spectrograms
P = {}
# compute model as the sum of spectrograms
model = eps
# parallelize this
for name, source in track.sources.items():
# compute spectrogram of target source:
# magnitude of STFT
src_coef = stft(source.audio.T, nperseg=4096, noverlap=1024)[-1].astype(np.complex64)
P[name] = np.abs(src_coef)
# store the original, not magnitude, in the mix
model += src_coef
# now performs separation
estimates = {}
accompaniment_source = 0
for name, source in track.sources.items():
source_mag = P[name]
# get mix phase/angle
mix_phase = atan2(model.imag, model.real)
# use source magnitude estimate + mix phase
Yj = np.multiply(source_mag, np.cos(mix_phase)) + 1j*np.multiply(source_mag, np.sin(mix_phase))
# invert to time domain
target_estimate = istft(Yj, nperseg=self.nperseg, noverlap=self.noverlap)[1].T[:self.N, :].astype(np.float32)
# set this as the source estimate
estimates[name] = target_estimate
# accumulate to the accompaniment if this is not vocals
if name != 'vocals':
accompaniment_source += target_estimate
estimates['accompaniment'] = accompaniment_source
bss_scores = museval.eval_mus_track(
track,
estimates,
output_dir=eval_dir,
).scores
return estimates, bss_scores
The maximum SDR of the "oracle mix phase" is lower than soft masking. Is that expected?
it's a very interesting idea, I like it
could you provide numbers ? how is it behaving compared to the other oracles ?
It's pretty underwhelming. Here is an evaluation of 4 tracks from the MUSDB18-HQ test set, with IRM1, IRM2, IBM1, IBM2, and the new one, "MPI" (Mixed Phase Inversion), with the Open-Unmix STFT settings (window = 4096, hop = 1024):
Open-Unmix is not the first time I've seen the source estimate magnitude + mix phase inversion. It's also used in the CDAE source separation algorithm (https://arxiv.org/abs/1703.08019) but I'm still curious why it is preferred to soft masking.
I will upload my code to generate the above results (it mostly just wraps sigsep tools) in a cleanly reproducible separate repo so I can link it here. I might be doing something wrong in my code somewhere.
Here: https://github.com/sevagh/mss-oracle-experiments#oracle-performance-of-mpi-mix-phase-inversion
Apologies if there is a lot of irrelevant code (related to the NSGT), but I hope the specific part of the new "Mixed Phase Inversion" oracle makes sense and is reproducible.
Also, I suppose SDR is not necessarily the king of metrics - we can see dramatically better ISR on the mix-phase (but that could be a consequence of its reduced separation/SDR/SIR/SAR).
Also maybe mix-phase is more "robust" to worse estimates?