asteroid icon indicating copy to clipboard operation
asteroid copied to clipboard

Wavesplit 2021

Open popcornell opened this issue 3 years ago • 13 comments

Should work now with oracle embedding. I made a separate pull request because it is faster See previous pull request also: https://github.com/asteroid-team/asteroid/pull/70 from last year. Many thanks to Neil (@lienz) again. Help from anyone is very welcomed as I am currently very GPU-constrained. Also time-constrained

popcornell avatar Feb 24 '21 15:02 popcornell

I'll review after @JorisCos

mpariente avatar Feb 25 '21 08:02 mpariente

It would be cool if someone can try to run the training with the full system and not oracle embeddings. You can wait for review when the full system has been trained and performance is decent

popcornell avatar Feb 25 '21 11:02 popcornell

Just letting you know that I am currently working on the recipe to run some experiments. Hopefully, the results will be as expected and we will finally merge this 🚀

JorisCos avatar Oct 07 '21 09:10 JorisCos

@JorisCos Does that mean there's a more current version of this branch somewhere? Would be nice to be able to take a look if possible.

lminer avatar Oct 15 '21 17:10 lminer

It seems to work well with oracle embedding (18.5dB score was improved in the WSJ-2mix validation after 50 epochs). But when two stacks are jointly trained, the separation stack yields almost the same signals as the mixture, and the SISDR metric tends to be zero. Could you please share the results if anyone has tried the complete pipeline? @popcornell @JorisCos

wangshuo182 avatar Jun 05 '22 14:06 wangshuo182

That's very interesting to know ! Unfortunately all I have is here on GitHub. Maybe Joris has more up to date code.

Do you think the degradation is due to over fitting of the training speakers IDs ? It may be. In the paper they use some things like speaker dropout to mitigate that. WSJ2Mix is small regarding speakers diversity after all and for reasonable speakers ID extraction usually you need tons of diversity e.g. voxceleb

popcornell avatar Jun 05 '22 14:06 popcornell

I tried to run this implementation on a dataset with around 60000 speakers and the speaker stack loss never changed. Could there be a bug somewhere?

On Sun, Jun 5, 2022 at 7:45 AM Samuele Cornell @.***> wrote:

That's very interesting to know ! Unfortunately all I have is here on GitHub. Maybe Joris has more up to date code.

Do you think the degradation is due to over fitting of the training speakers IDs ? It may be. In the paper they use some things like speaker dropout to mitigate that. WSJ2Mix is small regarding speakers diversity after all and for reasonable speakers ID extraction usually you need tons of diversity e.g. voxceleb

— Reply to this email directly, view it on GitHub https://github.com/asteroid-team/asteroid/pull/454#issuecomment-1146817507, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHDTFNPWCHUKR62HVR33KDVNS4PDANCNFSM4YEW5YWQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

lminer avatar Jun 05 '22 15:06 lminer

It seems to work well with oracle embedding (18.5dB score was improved in the WSJ-2mix validation after 50 epochs). But when two stacks are jointly trained, the separation stack yields almost the same signals as the mixture, and the SISDR metric tends to be zero. Could you please share the results if anyone has tried the complete pipeline? @popcornell @JorisCos

Hi, I also tried to run some experiments with Wavesplit (albeit in our own framework) in the past. I think a stagnating training of the speaker stack might result from two things:

  1. Missing shuffling of speaker IDs during training: The target speaker IDs need to be shuffled every time so that the model really needs the embeddings of the speaker stack to solve the permutation problem at the output. I haven't checked all of the code, but at least at first glance, I have not seen this shuffling for the target IDs.
  2. The latent dimension of the speaker stack is too high: If you don't use dynamic mixing (DM), the WSJ2mix dataset has 101 speakers, while the latent dimension is set to 256 in the corresponding paper. So, without DM (resulting in 285 speakers of WSJ0+WSJ1) there is no information bottleneck that would enable the model to generalize to unseen speakers. Even then, the generalization tricks that @popcornell mentioned (centroid dropout / mixup) still should be necessary for the model to really generalize to unseen speakers. Reducing the latent dimension to e.g. 64 and see if it's learning something then might work.

But even then, all of this should not prevent the model from at least overfitting to the training set.

TCord avatar Jun 08 '22 08:06 TCord

Do they use shuffling in the paper ? It sounds a very smart thing to do but they don't seem to use it. There is no shuffling here and it will be great to add because for sure prevents the model to be lazy and memorize the speakers. Your point on the dimension of the embedding is valid. I don't think dynamic mixing as they describe in the paper uses also WSJ1. If it uses it the results in the paper are not comparable with previous works as uses additional data (then better throw in LibriMix which has more speaker diversity !). If only WSJ0 then you have 101 speakers always even with DM. This exacerbates the problem for sure.

Most of the code here is also from the first version of the paper where there were not many augmentations on the speaker stack (no speaker dropout for example, maybe only gaussian noise ?). I did not implement these augmentations. Maybe Neil Zeghidour still has some other hidden tricks to make the model generalize better. @TCord were you able to succesfully replicate it to some decent degree ?

popcornell avatar Jun 08 '22 13:06 popcornell

@lminer did you use voxceleb ?

popcornell avatar Jun 08 '22 13:06 popcornell

If I remember it correctly, they also used the label shuffling in the paper. In my experiments, I did not use the architecture as proposed in the paper, but a Conv-TasNet as separation stack (i.e. I added an additional encoder/decoder layer) and reduced the total amount of layers. Here, I was able to train the model, but it did not improve upon the performance of a Conv-TasNet. My conjecture was that a sample-wise resolution as in the paper is necessary to obtain good results and provides the most significant improvements. I think, in the DPRNN papers it was also shown that choosing a very small window size and frame advance in the encoder further improved the separation for anechoic data. By employing an additional speaker stack jointly with a Conv-TasNet, the permutation problem could be solved through the speaker stack, but it did not improve the separation performance over a sole Conv-TasNet. As training the full-size Wavesplit model without any additional encoder/decoder layer takes a massive amount of GPU memory, I dropped these experiments afterwards.

TCord avatar Jun 08 '22 13:06 TCord

I have observed the same actually. Also according to https://arxiv.org/abs/2202.00733 the use of speaker ID info does in fact nor really help.

popcornell avatar Jun 08 '22 13:06 popcornell

@popcornell I used my own private dataset.

lminer avatar Jun 08 '22 17:06 lminer