meta-tasnet copied to clipboard
Support of stereo signals during separation (throughout the process)
Since I came across this model rather accidentally and have now roughly flown over it, I noticed that this can apparently only separate mono signals so far. From a technical point of view this is of course the much bigger challenge than a stereo signal, because all the room information is lost.
So I asked myself if it is possible to use this additional information. Would it be very difficult to extend the existing model to do so?
I find the approaches very exciting, because I have already got very good results from other models like Demucs, which also works in the waveform domain.
But this model seems to be much lighter than Demucs, for example, and the results seem to be very comparable.
The model would be much better if it could handle stereo signals. Then it would hardly be inferior to Demucs.
Hi, thank you for your interest! The network could be extended to directly handle a stereo signal by changing the number of input/output channels from 1 to 2 in the encoder and decoder, respectively.
Hi, like FSharpCSharp I have been testing Demucs and the ConvTasnet model implemented by Demucs group for a while and I´m very impressed by the results of meta-tasnet and how fast the model compute the separated stems.
How could I change the input/output channels in the code to execute stereo separation in the google colab notebook?. Thank you.
Hi, thanks! Please take a look at where we separate both channels independently. Let me know if that solves your problem :relaxed:
Thank You!!. It took me "a little" to understand the code but now I can separate stereo audio in the colab notebook.
Thank You!!. It took me "a little" to understand the code but now I can separate stereo audio in the colab notebook.
Im curious, how did you make it work. I am trying to make it work with 44.1kHz stereo input and correct output, so 4 stems of 44.1kHz and all stereo.
Hi, the code work with 1.5 minutes stereo songs and the output are four stereo stems with rate 32000 HZ.
I have appended this lines to the resample definition from the
**mix_left = [s[0:1, :, :] for s in mix]
mix_right = [s[1:2, :, :] for s in mix]**
del mix
And then I duplicate the code for left and right channel:
with torch.no_grad():
separationL = network.inference(mix_left, n_chunks=2)[-1] # call the network to obtain the separated audio with shape [1, 4, 1, T']
separationR = network.inference(mix_right, n_chunks=2)[-1] # call the network to obtain the separated audio with shape [1, 4, 1, T']
# normalize the amplitudes by computing the least squares
# -> we try to scale the separated stems so that their sum is equal to the input mix
aL = separationL[0,:,0,:].cpu().numpy().T # separated stems
aR = separationR[0,:,0,:].cpu().numpy().T # separated stems
bL = mix_left[-1][0,0,:].cpu().numpy() # input mix
bR = mix_right[-1][0,0,:].cpu().numpy() # input mix
solL = np.linalg.lstsq(aL, bL, rcond=None)[0] # scaling coefficients that minimize the MSE
solR = np.linalg.lstsq(aR, bR, rcond=None)[0] # scaling coefficients that minimize the MSE
separationL = aL * solL # scale the separated stems
separationR = aR * solR # scale the separated stems
Finally, concate left_stems and right_Stems:
separation = np.concatenate((separationL, separationR), axis=1)
estimates = {
'drums': separation[:,[0,4]],
'bass': separation[:,[1,5]],
'other': separation[:, [2,6]],
'vocals': separation[:,[3,7]],
Hey, I'm happy that it finally works for you :) Here's my gist for separating a stereo signal and resampling it back to the original sampling rate:
I have two questions after separated a bunch of tracks in mono and also in stereo:
The output stems are too loud. How can i deactivate the normalization of the audio stems?. This normalization causes very hard clipping in the signal. It occurs in stereo and mono separation and in all the tracks that I have tried.
As FSharpCSharp says in other issue I have also noticed the signal cut off about 12 dB above 10KHz.
Are these problems caused by any parameter in the estimated code?.
- This may be caused by the
player which normalizes every audio for some reason. Have you tried downloading the separated signal and playing in a proper audio player? - You're right, we're unsure what is the cause of this phenomenon. It seems to be an internal property of the neural network. Please let me know if you find out more about this.
- I actually delete IPython and the yoyutube lines from the code because I don´t use it. This is the code that call separate and write the output file: audio, rate = print("separating... ", end='') estimates = separate_sample(audio, rate) print("done") print("downloading audio files to the client side...")
for instrument in ['vocals', 'drums', 'bass', 'other']: separation = estimates[instrument] print(separation.shape) soundfile.write('WM1_5' + instrument + '.wav', separation, 32000)
After that I dowloaded the file and I opened it with Reaper. I put the original wav file (White Man Worlds from Jason Isbell) with the Demucs vocal track and the MultiTasnet Vocal track.
I see, that doesn't look good – could you share the exact code and .wav file that you use? Feel free to send it to my email address [email protected]
I run the notebook again with your separation stereo code instead of my "frankestein" version and it works smoothly. This is your code:
with torch.no_grad():
separation_left = network.inference(mix_left, n_chunks=8)[-1].cpu().squeeze_(2) # shape: (5, T)
separation_right = network.inference(mix_right, n_chunks=8)[-1].cpu().squeeze_(2) # shape: (5, T)
separation =[separation_left, separation_right], 0).numpy()
estimates = {
'drums': librosa.core.resample(separation[:, 0, :], 32000, rate, res_type='kaiser_best', fix=True)[:, :audio.shape[1]].T,
'bass': librosa.core.resample(separation[:, 1, :], 32000, rate, res_type='kaiser_best', fix=True)[:, :audio.shape[1]].T,
'other': librosa.core.resample(separation[:, 2, :], 32000, rate, res_type='kaiser_best', fix=True)[:, :audio.shape[1]].T,
'vocals': librosa.core.resample(separation[:, 3, :], 32000, rate, res_type='kaiser_best', fix=True)[:, :audio.shape[1]].T,
a_l = np.array([estimates['drums'][:, 0], estimates['bass'][:, 0], estimates['other'][:, 0], estimates['vocals'][:, 0]]).T
a_r = np.array([estimates['drums'][:, 1], estimates['bass'][:, 1], estimates['other'][:, 1], estimates['vocals'][:, 1]]).T
b_l = audio[0, :]
b_r = audio[1, :]
and I was using the code:
with torch.no_grad():
separationL = network.inference(mix_left, n_chunks=8)[-1]
# call the network to obtain the separated audio with shape [1, 4, 1, T']
separationR = network.inference(mix_right, n_chunks=8)[-1]
#Ojo que el parámetro de chunks en el códido original es chunks=2
# normalize the amplitudes by computing the least squares
# -> we try to scale the separated stems so that their sum is equal to the input mix
aL = separationL[0,:,0,:].cpu().numpy().T # separated stems
aR = separationR[0,:,0,:].cpu().numpy().T # separated stems
bL = mix_left[-1][0,0,:].cpu().numpy() # input mix
bR = mix_right[-1][0,0,:].cpu().numpy() # input mix
Thank you for answer so quick!
Haha, I'm glad it helped :)