EEND cant start training

I was testing the setup on mini librispeech data .This is log when I started training

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 19:24:21 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7ffb7b99c610>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
Traceback (most recent call last):
  File "../../../eend/bin/train.py", line 72, in <module>
    train(args)
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
    gpuid = use_single_gpu()
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 56, in use_single_gpu
    cvd = get_free_gpus()[0]
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 40, in get_free_gpus
    del gpus[busid]
KeyError: ' 00000000:01:00.0'
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec  5 19:24:22 IST 2019, elapsed time 1 seconds

can you guys suggest whats going wrong?

Dec 05 '19 14:12 008karan

What kind of cluster environments are you using? You may need to change https://github.com/hitachi-speech/EEND/blob/master/egs/mini_librispeech/v1/cmd.sh based on your environment accordingly. Check https://kaldi-asr.org/doc/queue.html

@yubouf, I strongly recommend to add more documents about cmd.sh and also change run.pl as default.

Dec 05 '19 14:12 sw005320

i am using conda environment. and using local machine so have changed to run.pl in cmd.sh

Dec 05 '19 14:12 008karan

@008karan Thank you for testing EEND. Consider set CUDA_VISIBLE_DEVICES. The gpu selection failure might come from cuda (nvidia-smi) version, where I had not tested on cuda10.

@sw005320 Thank you for your suggestion. I will change default to 'run.pl'

Dec 05 '19 14:12 yubouf

Oh, I see. Can you set CUDA_VISIBLE_DEVICES explicitly then?

Dec 05 '19 14:12 sw005320

after exporting CUDA_VISIBLE_DEVICES=1 here is the log

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 20:00:36 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7f9d28248910>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
Traceback (most recent call last):
  File "../../../eend/bin/train.py", line 72, in <module>
    train(args)
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
    gpuid = use_single_gpu()
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 64, in use_single_gpu
    chainer.cuda.get_device_from_id(cvd).use()
  File "cupy/cuda/device.pyx", line 135, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 141, in cupy.cuda.device.Device.use
  File "cupy/cuda/runtime.pyx", line 193, in cupy.cuda.runtime.setDevice
  File "cupy/cuda/runtime.pyx", line 145, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec  5 20:00:37 IST 2019, elapsed time 1 seconds

Dec 05 '19 14:12 008karan

CUDA_VISIBLE_DEVICES=0?

Dec 05 '19 14:12 sw005320

looks like training started but stopped

training model at exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train.
bash: line 1:  6217 Aborted                 (core dumped) ( train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train ) 2>> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log >> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log

log:

[
    {
        "main/loss": 0.8094631433486938,
        "main/speech_scored": 429.4651162790698,
        "main/speech_miss": 135.0,
        "main/speech_falarm": 20.930232558139537,
        "main/speaker_scored": 683.7209302325581,
        "main/speaker_miss": 351.7906976744186,
        "main/speaker_falarm": 55.25581395348837,
        "main/speaker_error": 28.13953488372093,
        "main/correct": 221.59302325581396,
        "main/diarization_error": 435.1860465116279,
        "main/frames": 453.25581395348837,
        "validation/main/loss": 0.7502496242523193,
        "validation/main/speech_scored": 377.26666666666665,
        "validation/main/speech_miss": 97.96666666666667,
        "validation/main/speech_falarm": 35.733333333333334,
        "validation/main/speaker_scored": 545.8,
        "validation/main/speaker_miss": 234.56666666666666,
        "validation/main/speaker_falarm": 83.8,
        "validation/main/speaker_error": 33.86666666666667,
        "validation/main/correct": 224.55,
        "validation/main/diarization_error": 352.23333333333335,
        "validation/main/frames": 417.6,
        "main/DER": 0.6364965986394557,
        "validation/main/DER": 0.6453523879320875,
        "main/SAD_MR": 0.3143445064168517,
        "validation/main/SAD_MR": 0.2596748542145256,
        "main/SAD_FR": 0.048735582390209566,
        "validation/main/SAD_FR": 0.09471638098603995,
        "main/MI": 0.5145238095238095,
        "validation/main/MI": 0.42976670331012584,
        "main/FA": 0.08081632653061224,
        "validation/main/FA": 0.1535360938072554,
        "main/CF": 0.04115646258503401,
        "validation/main/CF": 0.06204959081470625,
        "main/accuracy": 0.4888917393535146,
        "validation/main/accuracy": 0.5377155172413793,
        "epoch": 1,
        "iteration": 43,
        "elapsed_time": 107.64393779402599
    },
    {
        "main/loss": 0.6841620802879333,
        "main/speech_scored": 429.09302325581393,
        "main/speech_miss": 59.44186046511628,
        "main/speech_falarm": 22.41860465116279,
        "main/speaker_scored": 699.4651162790698,
        "main/speaker_miss": 238.53488372093022,
        "main/speaker_falarm": 89.3953488372093,
        "main/speaker_error": 21.53488372093023,
        "main/correct": 267.8953488372093,
        "main/diarization_error": 349.4651162790698,
        "main/frames": 453.3953488372093,
        "validation/main/loss": 0.6442975997924805,
        "validation/main/speech_scored": 377.26666666666665,
        "validation/main/speech_miss": 17.0,
        "validation/main/speech_falarm": 40.2,
        "validation/main/speaker_scored": 545.8,
        "validation/main/speaker_miss": 92.33333333333333,
        "validation/main/speaker_falarm": 159.63333333333333,
        "validation/main/speaker_error": 20.066666666666666,
        "validation/main/correct": 271.55,
        "validation/main/diarization_error": 272.03333333333336,
        "validation/main/frames": 417.6,
        "main/DER": 0.4996176480367058,
        "validation/main/DER": 0.4984121167704899,
        "main/SAD_MR": 0.13852907701479594,
        "validation/main/SAD_MR": 0.045060964834776465,
        "main/SAD_FR": 0.05224649070511084,
        "validation/main/SAD_FR": 0.10655592860929494,
        "main/MI": 0.3410247032616285,
        "validation/main/MI": 0.16917063637474045,
        "main/FA": 0.1278052997306912,
        "validation/main/FA": 0.2924758763893978,
        "main/CF": 0.030787645044386077,
        "validation/main/CF": 0.03676560400635154,
        "main/accuracy": 0.5908647927780057,
        "validation/main/accuracy": 0.6502634099616859,
        "epoch": 2,
        "iteration": 86,
        "elapsed_time": 178.86651986418292
    }
]

Dec 05 '19 14:12 008karan

The losses of the two epochs look good. Now I have no idea of the core dump cause.

What does exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log say?

Dec 05 '19 14:12 yubouf

here it is

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 20:08:49 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7fd94ec1a850>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
GPU device 0 is used
Prepared model
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.607145 to fit
epoch       main/loss   validation/main/loss  main/diarization_error_rate  validation/main/diarization_error_rate  elapsed_time
Tcl_AsyncDelete: async handler deleted by the wrong thread
# Accounting: time=187 threads=1
# Ended (code 134) at Thu Dec  5 20:11:56 IST 2019, elapsed time 187 seconds

Dec 05 '19 15:12 008karan

where are the hyper parameter of model? maybe reducing the batch size would help

Dec 05 '19 15:12 008karan

See conf directory. conf/train.yml have hyperparameters.

Dec 05 '19 16:12 yubouf

Thanks for the help. I really appreciate your quick reply. @yubouf

After reducing the batch size training completed with 29% DER. Now I need to test it on my custom data. Got some doubts here:

Is this repo implementation of 'End-to-End Neural Speaker Diarization with Permutation-free Objectives' or 'End-to-End Neural Speaker Diarization with Self-attention'.
How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.
Getting confused with the directory structure of the repo. I need to test it on my custom data. I have collected some audio data having a single speaker in each audio. I don't have the transcript. Found from comments that there should be segments, reco2dur, wav.scp, utt2spk, and spk2utt files for training. segment mean audio having a single speaker for saying 1 utterance reco2dur is for the duration of that audio wav.scp for list of audio utt2spk and _spk2ut_t for mapping In repo these files were only in dev_clean_2 not in train_clean_2.

Also there is diarization_data with mix audio what's that for?

I think I am missing something. Can you spread some light on what should be dataset format and structure for speaker diarization.

Dec 06 '19 11:12 008karan

Is this repo implementation of 'End-to-End Neural Speaker Diarization with Permutation-free Objectives' or 'End-to-End Neural Speaker Diarization with Self-attention'.

Both. The latest network configuration is based on 'End-to-End Neural Speaker Diarization with Self-attention'.

How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.

"mini_librispeech" model is prepared just for the code integration tests, not related to the papers. It's better to train a model in the "callhome" recipe. But it requires huge data and training time is needed.

I'm afraid the current code is not intended for the inference-only purpose. For inference, see below:

https://github.com/hitachi-speech/EEND/blob/9a0f211ce7e377eaea242490c3d7ec0f6adab8af/egs/mini_librispeech/v1/run.sh#L106-L117

data/simu/data/dev_clean_2_ns2_beta2_500 is the kaldi-style data directory for inference.

Getting confused with the directory structure of the repo. I need to test it on my custom data. I have collected some audio data having a single speaker in each audio. I don't have the transcript. Found from comments that there should be segments, reco2dur, wav.scp, utt2spk, and spk2utt files for training. segment mean audio having a single speaker for saying 1 utterance reco2dur is for the duration of that audio wav.scp for list of audio utt2spk and _spk2ut_t for mapping In repo these files were only in dev_clean_2 not in train_clean_2. Also there is diarization_data with mix audio what's that for?

train_clean_2 and dev_clean_2 are not actual training and test data for our model. These are mini_librispeech dataset. Our training and test data is generated by simulation: Training: data/simu/data/train_clean_5_ns2_beta2_500 Test: data/simu/data/dev_clean_2_ns2_beta2_500.

Dec 06 '19 18:12 yubouf

ok so training data should contain call recording of two people that's what you simulated right? Can you tell me how much data is needed and training time? Also, it is independent of a person who is speaking right?
I would like to try both the papers which you have published. Where can I find the implementation for 'End-to-End Neural Speaker Diarization with Permutation-free Objectives'. I am assuming that both have the same data input.
I have gone through data/simu/data/train_clean_5_ns2_beta2_500: As there is no documentation not getting whats in the following files In rttm

SPEAKER data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066 1    2.08   15.75 <NA> <NA> 1088-134315 <NA>

In segment file for example below :

1088-134315_data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066_0000208_0001782 data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066 2.08325 17.82825

in spk2utt: as per my understanding audio mixture generated by 1088 and 134315 are audio number 66,208,1782

1088-134315 1088-134315_data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066_0000208_0001782

same goes with utt2spk and lastly wav.scp having mapping between directory


data_simu_wav_train_clean_5_ns2_beta2_500_100_mix_0000496 /home/gamut/Downloads/EEND/egs/mini_librispeech/v1/data/simu/wav/train_clean_5_ns2_beta2_500/100/mix_0000496.wav

please elaborate where i am wrong and whats actually in those files. As of now, I have audio recording data of having two speakers in each audio. So do I need to label each speaker in it and generate the mapping you got in all files above?

Thanks!

Dec 07 '19 10:12 008karan

Explanation of Kaldi's data directory: https://kaldi-asr.org/doc/data_prep.html RTTM: https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

To know how we generate the simulated training data, see run_prepare_shared.sh with our paper, particularly Algorithm 1. Training time was not described in the papers. It depends on computing environments. In our experiments, for 100,000 mixtures (generated with beta=2) with 100 epochs, it took 4-6 days.

Dec 08 '19 21:12 yubouf

I already have audio recordings so no need to simulate but do I need to get the transcript?

Dec 09 '19 08:12 008karan

I already have audio recordings so no need to simulate but do I need to get the transcript?

No. You don’t have to prepare the text file.

Dec 09 '19 13:12 yubouf

Thanks for the links. Got some doubt here: In RTTM

SPEAKER data_simu_wav_train_clean_5_ns2_beta2_500_100_mix_0000500 1    2.82    4.27 <NA> <NA> 1867-154075 <NA>

Is tbeg(2.82) and tdur(4.27) randomly generated here as I couldn't find difference after hearing the mix audio file. Same goes with found in segment file. I found segments which you are passing are randomly generated?

Lastly in spk2utt and utt2spk : which require <utterance-id> <speaker-id> how to get it as I got audio recording at first place.

Cheers!

Dec 10 '19 17:12 008karan

Yes, the training data is the simulated two-speaker mixture of "mini_librespeech" utterances with randomly chosen silence intervals. segments and rttm reflects the random simulation result. Each "mini_librespeech" utterance might be longer containing several sentences, it seemed strange mixture. But again, this is just intended for the integration test. Our actual recipe related to the paper is the "callhome" recipe.

Suppose you already have your two-speaker mixtures for training data: audio recordings: rec1.wav, rec2.wav, ... and segmentation for two speakers per recording. You should prepare these below: wav.scp: the list of <recording> <file> like

rec1 rec1.wav
rec2 rec2.wav
...

segments: the list of <utterance> <recording> <start_time> <end_time> like

rec1_Alice_001 rec1 2.0 4.5
rec1_Bob_001 rec1 4.3 8.0
rec1_Alice_002 rec1 10.0 11.5
rec2_Charlie_001 rec2 3.3 4.4
rec2_Charlie_002 rec2 5.5 6.0
rec2_Daisy_001 rec2 7.0 7.5

utt2spk: the list of <utterance> <speaker> like

rec1_Alice_001 Alice
rec1_Alice_002 Alice
rec1_Bob_001 Bob
rec2_Charlie_001 Charlie
rec2_Charlie_002 Charlie
rec2_Daisy_001 Daisy
...

Then, you can generate spk2utt, reco2dur, rttm using kaldi-tools. rttm from steps/segmentation/convert_utt2spk_and_segments_to_rttm.py reco2dur from utils/data/get_reco2dur.sh spk2utt from utils/utt2spk_to_spk2utt.pl

Dec 10 '19 20:12 yubouf

hi, i got all the files and started training but nothing is happening. There is nothing inside data.data.train except cg.dot and cg.png

Dec 19 '19 21:12 008karan

train log

# train.py -c conf/train.yaml data data exp/diarize/model/data.data.train 
# Started at Fri Dec 20 12:27:28 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
GPU device 0 is used
Prepared model

Dec 20 '19 07:12 008karan

The log indicates that train.py is still on hold. If the mini_librispeech recipe had worked, the difference might be your data preparation.

[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on

I have no idea about those lines.

Dec 20 '19 18:12 yubouf

before trainer.run() everything is printing out.

    trainer.extend(extensions.dump_graph('main/loss', out_name="cg.dot"))
    print('###########################5')
    trainer.run()
    print('Finished!')

can you suggest how to debug further

Dec 22 '19 12:12 008karan

When you interrupt the program by Ctrl+C, you will find the stack trace and possible cause of the stop.

I'm afraid that it's hard to find the problem because it might be related to the data preparation of your data. If you could open the data for me, I could run that for debugging. But I don't want to face a risk of getting sensitive speech data you own. When you try our code with other publicly available data, we possibly solve your issue.

Dec 23 '19 17:12 yubouf

i am getting this results. can you help me with inference

        "main/loss": 0.5038370490074158,
        "main/speech_scored": 242.6723163841808,
        "main/speech_miss": 149.28248587570621,
        "main/speech_falarm": 22.28813559322034,
        "main/speaker_scored": 242.6723163841808,
        "main/speaker_miss": 149.28248587570621,
        "main/speaker_falarm": 22.51412429378531,
        "main/speaker_error": 18.163841807909606,
        "main/correct": 346.4124293785311,
        "main/diarization_error": 189.96045197740114,
        "main/frames": 450.47457627118644,
        "validation/main/loss": 0.4737112522125244,
        "validation/main/speech_scored": 286.44943820224717,
        "validation/main/speech_miss": 86.78651685393258,
        "validation/main/speech_falarm": 40.17977528089887,
        "validation/main/speaker_scored": 286.44943820224717,
        "validation/main/speaker_miss": 86.78651685393258,
        "validation/main/speaker_falarm": 42.40449438202247,
        "validation/main/speaker_error": 40.95505617977528,
        "validation/main/correct": 348.02247191011236,
        "validation/main/diarization_error": 170.14606741573033,
        "validation/main/frames": 453.5730337078652,
        "main/DER": 0.7827858356808605,
        "validation/main/DER": 0.5939828979367695,
        "main/SAD_MR": 0.6151607571066049,
        "validation/main/SAD_MR": 0.3029732486075155,
        "main/SAD_FR": 0.09184457430214421,
        "validation/main/SAD_FR": 0.14026829842315838,
        "main/MI": 0.6151607571066049,
        "validation/main/MI": 0.3029732486075155,
        "main/FA": 0.09277582473866784,
        "validation/main/FA": 0.1480348317251118,
        "main/CF": 0.07484925383558774,
        "validation/main/CF": 0.14297481760414218,
        "main/accuracy": 0.7689944064012842,
        "validation/main/accuracy": 0.7672909235037653,
        "epoch": 10,
        "iteration": 1777,
        "elapsed_time": 903.8602520569693

Jan 06 '20 09:01 008karan

Copied my earlier comment.

How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.

"mini_librispeech" model is prepared just for the code integration tests, not related to the papers. It's better to train a model in the "callhome" recipe. But it requires huge data and training time is needed.

I'm afraid the current code is not intended for the inference-only purpose. For inference, see below:

https://github.com/hitachi-speech/EEND/blob/9a0f211ce7e377eaea242490c3d7ec0f6adab8af/egs/mini_librispeech/v1/run.sh#L106-L117

data/simu/data/dev_clean_2_ns2_beta2_500 is the kaldi-style data directory for inference.

Jan 06 '20 18:01 yubouf

"main/DER": 0.7827858356808605, means the performance is very poor. "iteration": 1777, indicates that your training data size is too small.

Jan 06 '20 18:01 yubouf

ok, can you suggest how much hours of data is needed to build a good speaker diarization system? Also, can we do this without timestamps? As you know getting audio with accurate timestamps is a difficult task. Thanks

Jan 11 '20 07:01 008karan

We didn't use manual timestamps for simulated mixtures and two-channel recordings. In both cases, we have single-speaker recordings. So we can get timestamps via a speech activity detection system. In our papers, I suggested using a simulated training set of 100k recordings, sampled from large-scale telephone recordings, which have separate channels. The "callhome" recipe is good for general diarization tasks, but two-speaker recordings only. Although we observed the real training set was better than the simulated dataset, we believe that better large-scale simulation with better model architecture can outperform the smaller real training set.

Jan 11 '20 20:01 yubouf

Explanation of Kaldi's data directory: https://kaldi-asr.org/doc/data_prep.html RTTM: https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

To know how we generate the simulated training data, see run_prepare_shared.sh with our paper, particularly Algorithm 1. Training time was not described in the papers. It depends on computing environments. In our experiments, for 100,000 mixtures (generated with beta=2) with 100 epochs, it took 4-6 days.

@yubouf Could you please reveal the GPU you used, so I can roughly estimate the training time in my case?

Feb 12 '20 09:02 AntonOkhotnikov

EEND EEND copied to clipboard

cant start training

EEND
EEND copied to clipboard