Concurrent_Speakers_Counter icon indicating copy to clipboard operation
Concurrent_Speakers_Counter copied to clipboard

Estimate the number of concurrent speakers from single channel mixtures to crack the "cocktail-party” problem.

Concurrent Speakers Counter

Estimate the number of concurrent speakers from single channel mixtures to crack the "cocktail-party” problem which is based on a Bidirectional Long Short-Term Memory (BLSTM) which takes into account a past and future temporal context.

1. The model of the paper

Layer Layer Name Input Shape Output Shape
the First Layer BLSTM_1 (?, 500, 201) (?, 500, 60)
the Second Layer BLSTM_2 (?, 500, 60) (?, 500, 40)
the Third Layer BLSTM_3 (?, 500, 40) (?, 500, 80)
the Fourth Layer maxpooling1d (?, 500, 80) (?, 250, 80)
the Fifth Layer flatten (?, 250, 80) (?, 20000)
the Sixth Layer dense (?, 20000) (?, 11)
the Seventh Layer activation (?, 11) (?, 11)

"?" represents the number of samples.

2. My Model

3. Dependency Library

  • librosa
  • soundfile
  • Keras (my test version: 2.1.1)
  • Tensorflow (my test version: 1.4.0)
  • Anaconda3 (Contains Python3.5+)

4. Dataset

It is called LibriCount10 0dB Dataset.

  • contains a simulated cocktail party environment of [0..10] speakers
  • mixed with 0dB SNR
  • 5 seconds of recording
  • 16bits, 16kHz, mono
  • 11440 Samples, 832.5 MB

The annotation provides information about the speakers sex, their unique speaker_id, and vocal activity within the mixture recording in samples. The format of json file (3 speakers) is as follows:

[
    {
        "sex": "F",
        "activity": [[0, 51076], [51396, 55400], [56681, 80000]], 
        "speaker_id": 1221
    },
    {
        "sex": "F",
        "activity": [[0, 51877], [56201, 80000]],
        "speaker_id": 3570
    },
    {
        "sex": "M",
        "activity": [[0, 15681], [16161, 68213], [73498, 80000]], 
        "speaker_id": 5105
    }
]

5. Reference Paper

As we all know, it's pretty hard to solve the cocktail-party problem. This is the first study on data-driven speaker count estimation and the first step to crack the problem. Thanks for the author's paper[Paper 2] and code which help me a lot. Their homepage is AudioLabs Erlangen CountNet.

  • Paper 1: Simon Leglaive, Romain Hennequin and Roland Badeau. Singing voice detection with deep recurrent neural networks (ICASSP 2015).
  • Paper 2: Fabian-Robert Stöter, Soumitro Chakrabarty, Bernd Edler and Emanuël A. P. Habets. Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation (ICASSP2018).

6. Recommended links

6. Follow-up Work

I will work on speech separation for a long time. You can fork this repository if interested and pay close attention to my recent study.