deep-clustering Model correctly separates sources at any given time but mixes up speakers over utterance

As an experiment, I built a training set with two speakers, one male and the other female, ~2 hrs of speech each, split into 3-second utterances. My goal was to fit the training set, without concern for generalization yet.

After a few hours of training, I mixed two samples from the training set and ran audio_test.py.

The result cleanly separates the sources (no voice overlap) but both speakers can be sequentially heard in each output file.

For instance, if speaker 1 says "the coming elections will be a battle" and speaker 2 says "the weather is glorious in Casblanca", then I get something like:

output file 1: "the coming elections ... is glorious in Casablanca" output file 2: "the weather ... will be a battle".

Any idea how to get only one speaker per output file ?

Feb 17 '18 19:02 ericbolo

The model doesn’t have the module to detect the number of speakers in a mixture, and another module to decide the matching order to concatenate the chunks of frames is also needed. Both the modules are now done manually and it’s not reported in the original paper.

Feb 18 '18 00:02 zhr1201

Thank you for your answer. I now understand the reason behind the "oracle" lists in audio_test.py.

To avoid setting them manually I simply increased FRAMES_PER_SAMPLE to 1000 to capture the entire utterance. The results (on training set data, different gender) are good.

I understand that in a realistic context it is impractical to increase the frames per sample to span the entire audio file.

This paper presents a related method which seems to do away with the post-clustering step (https://arxiv.org/pdf/1707.03634.pdf). Haven't read it in detail though...

Feb 18 '18 10:02 ericbolo

Thanks for the reference. There is a rep about that https://github.com/khaotik/DaNet-Tensorflow. However, there are some bugs and it cannot generate reasonable separation and I am currently working on that. It would be great if we could work on that together!

Feb 18 '18 12:02 zhr1201

Great! I'll give that repo a closer look and read the paper carefully, then get back to you

On Feb 18, 2018 1:12 PM, "Haoran Zhou" [email protected] wrote:

Thanks for the reference. There is a rep about that https://github.com/khaotik/DaNet-Tensorflow. However, there are some bugs and it cannot generate reasonable separation and I am currently working on that. It would be great if we could work on that together!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zhr1201/deep-clustering/issues/8#issuecomment-366511646, or mute the thread https://github.com/notifications/unsubscribe-auth/AEXQ_JXP9gJgdqofI4927EfdCApL7pG3ks5tWBO2gaJpZM4SJbbH .

Feb 20 '18 13:02 ericbolo

deep-clustering deep-clustering copied to clipboard

Model correctly separates sources at any given time but mixes up speakers over utterance

deep-clustering
deep-clustering copied to clipboard