EEND
EEND copied to clipboard
cant start training
I was testing the setup on mini librispeech data .This is log when I started training
# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train
# Started at Thu Dec 5 19:24:21 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7ffb7b99c610>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730 chunks
1863 chunks
Traceback (most recent call last):
File "../../../eend/bin/train.py", line 72, in <module>
train(args)
File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
gpuid = use_single_gpu()
File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 56, in use_single_gpu
cvd = get_free_gpus()[0]
File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 40, in get_free_gpus
del gpus[busid]
KeyError: ' 00000000:01:00.0'
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec 5 19:24:22 IST 2019, elapsed time 1 seconds
can you guys suggest whats going wrong?
What kind of cluster environments are you using? You may need to change https://github.com/hitachi-speech/EEND/blob/master/egs/mini_librispeech/v1/cmd.sh based on your environment accordingly. Check https://kaldi-asr.org/doc/queue.html
@yubouf, I strongly recommend to add more documents about cmd.sh
and also change run.pl
as default.
i am using conda environment. and using local machine so have changed to run.pl in cmd.sh
@008karan Thank you for testing EEND. Consider set CUDA_VISIBLE_DEVICES. The gpu selection failure might come from cuda (nvidia-smi) version, where I had not tested on cuda10.
@sw005320 Thank you for your suggestion. I will change default to 'run.pl'
Oh, I see. Can you set CUDA_VISIBLE_DEVICES explicitly then?
after exporting CUDA_VISIBLE_DEVICES=1 here is the log
# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train
# Started at Thu Dec 5 20:00:36 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7f9d28248910>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730 chunks
1863 chunks
Traceback (most recent call last):
File "../../../eend/bin/train.py", line 72, in <module>
train(args)
File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
gpuid = use_single_gpu()
File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 64, in use_single_gpu
chainer.cuda.get_device_from_id(cvd).use()
File "cupy/cuda/device.pyx", line 135, in cupy.cuda.device.Device.use
File "cupy/cuda/device.pyx", line 141, in cupy.cuda.device.Device.use
File "cupy/cuda/runtime.pyx", line 193, in cupy.cuda.runtime.setDevice
File "cupy/cuda/runtime.pyx", line 145, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec 5 20:00:37 IST 2019, elapsed time 1 seconds
CUDA_VISIBLE_DEVICES=0
?
looks like training started but stopped
training model at exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train.
bash: line 1: 6217 Aborted (core dumped) ( train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train ) 2>> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log >> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log
log:
[
{
"main/loss": 0.8094631433486938,
"main/speech_scored": 429.4651162790698,
"main/speech_miss": 135.0,
"main/speech_falarm": 20.930232558139537,
"main/speaker_scored": 683.7209302325581,
"main/speaker_miss": 351.7906976744186,
"main/speaker_falarm": 55.25581395348837,
"main/speaker_error": 28.13953488372093,
"main/correct": 221.59302325581396,
"main/diarization_error": 435.1860465116279,
"main/frames": 453.25581395348837,
"validation/main/loss": 0.7502496242523193,
"validation/main/speech_scored": 377.26666666666665,
"validation/main/speech_miss": 97.96666666666667,
"validation/main/speech_falarm": 35.733333333333334,
"validation/main/speaker_scored": 545.8,
"validation/main/speaker_miss": 234.56666666666666,
"validation/main/speaker_falarm": 83.8,
"validation/main/speaker_error": 33.86666666666667,
"validation/main/correct": 224.55,
"validation/main/diarization_error": 352.23333333333335,
"validation/main/frames": 417.6,
"main/DER": 0.6364965986394557,
"validation/main/DER": 0.6453523879320875,
"main/SAD_MR": 0.3143445064168517,
"validation/main/SAD_MR": 0.2596748542145256,
"main/SAD_FR": 0.048735582390209566,
"validation/main/SAD_FR": 0.09471638098603995,
"main/MI": 0.5145238095238095,
"validation/main/MI": 0.42976670331012584,
"main/FA": 0.08081632653061224,
"validation/main/FA": 0.1535360938072554,
"main/CF": 0.04115646258503401,
"validation/main/CF": 0.06204959081470625,
"main/accuracy": 0.4888917393535146,
"validation/main/accuracy": 0.5377155172413793,
"epoch": 1,
"iteration": 43,
"elapsed_time": 107.64393779402599
},
{
"main/loss": 0.6841620802879333,
"main/speech_scored": 429.09302325581393,
"main/speech_miss": 59.44186046511628,
"main/speech_falarm": 22.41860465116279,
"main/speaker_scored": 699.4651162790698,
"main/speaker_miss": 238.53488372093022,
"main/speaker_falarm": 89.3953488372093,
"main/speaker_error": 21.53488372093023,
"main/correct": 267.8953488372093,
"main/diarization_error": 349.4651162790698,
"main/frames": 453.3953488372093,
"validation/main/loss": 0.6442975997924805,
"validation/main/speech_scored": 377.26666666666665,
"validation/main/speech_miss": 17.0,
"validation/main/speech_falarm": 40.2,
"validation/main/speaker_scored": 545.8,
"validation/main/speaker_miss": 92.33333333333333,
"validation/main/speaker_falarm": 159.63333333333333,
"validation/main/speaker_error": 20.066666666666666,
"validation/main/correct": 271.55,
"validation/main/diarization_error": 272.03333333333336,
"validation/main/frames": 417.6,
"main/DER": 0.4996176480367058,
"validation/main/DER": 0.4984121167704899,
"main/SAD_MR": 0.13852907701479594,
"validation/main/SAD_MR": 0.045060964834776465,
"main/SAD_FR": 0.05224649070511084,
"validation/main/SAD_FR": 0.10655592860929494,
"main/MI": 0.3410247032616285,
"validation/main/MI": 0.16917063637474045,
"main/FA": 0.1278052997306912,
"validation/main/FA": 0.2924758763893978,
"main/CF": 0.030787645044386077,
"validation/main/CF": 0.03676560400635154,
"main/accuracy": 0.5908647927780057,
"validation/main/accuracy": 0.6502634099616859,
"epoch": 2,
"iteration": 86,
"elapsed_time": 178.86651986418292
}
]
The losses of the two epochs look good. Now I have no idea of the core dump cause.
What does exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log
say?
here it is
# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train
# Started at Thu Dec 5 20:08:49 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7fd94ec1a850>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730 chunks
1863 chunks
GPU device 0 is used
Prepared model
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.607145 to fit
epoch main/loss validation/main/loss main/diarization_error_rate validation/main/diarization_error_rate elapsed_time
Tcl_AsyncDelete: async handler deleted by the wrong thread
# Accounting: time=187 threads=1
# Ended (code 134) at Thu Dec 5 20:11:56 IST 2019, elapsed time 187 seconds
where are the hyper parameter of model? maybe reducing the batch size would help
See conf
directory. conf/train.yml
have hyperparameters.
Thanks for the help. I really appreciate your quick reply. @yubouf
After reducing the batch size training completed with 29% DER. Now I need to test it on my custom data. Got some doubts here:
-
Is this repo implementation of 'End-to-End Neural Speaker Diarization with Permutation-free Objectives' or 'End-to-End Neural Speaker Diarization with Self-attention'.
-
How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.
-
Getting confused with the directory structure of the repo. I need to test it on my custom data. I have collected some audio data having a single speaker in each audio. I don't have the transcript. Found from comments that there should be segments, reco2dur, wav.scp, utt2spk, and spk2utt files for training. segment mean audio having a single speaker for saying 1 utterance reco2dur is for the duration of that audio wav.scp for list of audio utt2spk and _spk2ut_t for mapping In repo these files were only in dev_clean_2 not in train_clean_2.
Also there is diarization_data with mix audio what's that for?
I think I am missing something. Can you spread some light on what should be dataset format and structure for speaker diarization.
- Is this repo implementation of 'End-to-End Neural Speaker Diarization with Permutation-free Objectives' or 'End-to-End Neural Speaker Diarization with Self-attention'.
Both. The latest network configuration is based on 'End-to-End Neural Speaker Diarization with Self-attention'.
- How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.
"mini_librispeech" model is prepared just for the code integration tests, not related to the papers. It's better to train a model in the "callhome" recipe. But it requires huge data and training time is needed.
I'm afraid the current code is not intended for the inference-only purpose. For inference, see below:
https://github.com/hitachi-speech/EEND/blob/9a0f211ce7e377eaea242490c3d7ec0f6adab8af/egs/mini_librispeech/v1/run.sh#L106-L117
data/simu/data/dev_clean_2_ns2_beta2_500
is the kaldi-style data directory for inference.
- Getting confused with the directory structure of the repo. I need to test it on my custom data. I have collected some audio data having a single speaker in each audio. I don't have the transcript. Found from comments that there should be segments, reco2dur, wav.scp, utt2spk, and spk2utt files for training. segment mean audio having a single speaker for saying 1 utterance reco2dur is for the duration of that audio wav.scp for list of audio utt2spk and _spk2ut_t for mapping In repo these files were only in dev_clean_2 not in train_clean_2. Also there is diarization_data with mix audio what's that for?
train_clean_2 and dev_clean_2 are not actual training and test data for our model.
These are mini_librispeech dataset.
Our training and test data is generated by simulation:
Training: data/simu/data/train_clean_5_ns2_beta2_500
Test: data/simu/data/dev_clean_2_ns2_beta2_500
.
-
ok so training data should contain call recording of two people that's what you simulated right? Can you tell me how much data is needed and training time? Also, it is independent of a person who is speaking right?
-
I would like to try both the papers which you have published. Where can I find the implementation for 'End-to-End Neural Speaker Diarization with Permutation-free Objectives'. I am assuming that both have the same data input.
-
I have gone through
data/simu/data/train_clean_5_ns2_beta2_500
: As there is no documentation not getting whats in the following files Inrttm
SPEAKER data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066 1 2.08 15.75 <NA> <NA> 1088-134315 <NA>
In segment
file for example below :
1088-134315_data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066_0000208_0001782 data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066 2.08325 17.82825
in spk2utt
: as per my understanding audio mixture generated by 1088 and 134315 are audio number 66,208,1782
1088-134315 1088-134315_data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066_0000208_0001782
same goes with utt2spk
and lastly wav.scp
having mapping between directory
data_simu_wav_train_clean_5_ns2_beta2_500_100_mix_0000496 /home/gamut/Downloads/EEND/egs/mini_librispeech/v1/data/simu/wav/train_clean_5_ns2_beta2_500/100/mix_0000496.wav
please elaborate where i am wrong and whats actually in those files. As of now, I have audio recording data of having two speakers in each audio. So do I need to label each speaker in it and generate the mapping you got in all files above?
Thanks!
Explanation of Kaldi's data directory: https://kaldi-asr.org/doc/data_prep.html RTTM: https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf
To know how we generate the simulated training data, see run_prepare_shared.sh
with our paper, particularly Algorithm 1.
Training time was not described in the papers. It depends on computing environments. In our experiments, for 100,000 mixtures (generated with beta=2) with 100 epochs, it took 4-6 days.
I already have audio recordings so no need to simulate but do I need to get the transcript?
I already have audio recordings so no need to simulate but do I need to get the transcript?
No. You don’t have to prepare the text file.
Thanks for the links. Got some doubt here: In RTTM
SPEAKER data_simu_wav_train_clean_5_ns2_beta2_500_100_mix_0000500 1 2.82 4.27 <NA> <NA> 1867-154075 <NA>
Is tbeg(2.82) and tdur(4.27) randomly generated here as I couldn't find difference after hearing the mix audio file. Same goes with found in segment
file. I found segments which you are passing are randomly generated?
Lastly in spk2utt and utt2spk : which require <utterance-id> <speaker-id>
how to get it as I got audio recording at first place.
Cheers!
Yes, the training data is the simulated two-speaker mixture of "mini_librespeech" utterances with randomly chosen silence intervals. segments
and rttm
reflects the random simulation result.
Each "mini_librespeech" utterance might be longer containing several sentences, it seemed strange mixture. But again, this is just intended for the integration test.
Our actual recipe related to the paper is the "callhome" recipe.
Suppose you already have your two-speaker mixtures for training data:
audio recordings:
rec1.wav, rec2.wav, ...
and segmentation for two speakers per recording.
You should prepare these below:
wav.scp
: the list of <recording> <file>
like
rec1 rec1.wav
rec2 rec2.wav
...
segments
: the list of <utterance> <recording> <start_time> <end_time>
like
rec1_Alice_001 rec1 2.0 4.5
rec1_Bob_001 rec1 4.3 8.0
rec1_Alice_002 rec1 10.0 11.5
rec2_Charlie_001 rec2 3.3 4.4
rec2_Charlie_002 rec2 5.5 6.0
rec2_Daisy_001 rec2 7.0 7.5
utt2spk
: the list of <utterance> <speaker>
like
rec1_Alice_001 Alice
rec1_Alice_002 Alice
rec1_Bob_001 Bob
rec2_Charlie_001 Charlie
rec2_Charlie_002 Charlie
rec2_Daisy_001 Daisy
...
Then, you can generate spk2utt
, reco2dur
, rttm
using kaldi-tools.
rttm
from steps/segmentation/convert_utt2spk_and_segments_to_rttm.py
reco2dur
from utils/data/get_reco2dur.sh
spk2utt
from utils/utt2spk_to_spk2utt.pl
hi, i got all the files and started training but nothing is happening. There is nothing inside data.data.train
except cg.dot
and cg.png
train log
# train.py -c conf/train.yaml data data exp/diarize/model/data.data.train
# Started at Fri Dec 20 12:27:28 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843 chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843 chunks
GPU device 0 is used
Prepared model
The log indicates that train.py
is still on hold.
If the mini_librispeech
recipe had worked, the difference might be your data preparation.
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
I have no idea about those lines.
before trainer.run()
everything is printing out.
trainer.extend(extensions.dump_graph('main/loss', out_name="cg.dot"))
print('###########################5')
trainer.run()
print('Finished!')
can you suggest how to debug further
When you interrupt the program by Ctrl+C, you will find the stack trace and possible cause of the stop.
I'm afraid that it's hard to find the problem because it might be related to the data preparation of your data. If you could open the data for me, I could run that for debugging. But I don't want to face a risk of getting sensitive speech data you own. When you try our code with other publicly available data, we possibly solve your issue.
i am getting this results. can you help me with inference
"main/loss": 0.5038370490074158,
"main/speech_scored": 242.6723163841808,
"main/speech_miss": 149.28248587570621,
"main/speech_falarm": 22.28813559322034,
"main/speaker_scored": 242.6723163841808,
"main/speaker_miss": 149.28248587570621,
"main/speaker_falarm": 22.51412429378531,
"main/speaker_error": 18.163841807909606,
"main/correct": 346.4124293785311,
"main/diarization_error": 189.96045197740114,
"main/frames": 450.47457627118644,
"validation/main/loss": 0.4737112522125244,
"validation/main/speech_scored": 286.44943820224717,
"validation/main/speech_miss": 86.78651685393258,
"validation/main/speech_falarm": 40.17977528089887,
"validation/main/speaker_scored": 286.44943820224717,
"validation/main/speaker_miss": 86.78651685393258,
"validation/main/speaker_falarm": 42.40449438202247,
"validation/main/speaker_error": 40.95505617977528,
"validation/main/correct": 348.02247191011236,
"validation/main/diarization_error": 170.14606741573033,
"validation/main/frames": 453.5730337078652,
"main/DER": 0.7827858356808605,
"validation/main/DER": 0.5939828979367695,
"main/SAD_MR": 0.6151607571066049,
"validation/main/SAD_MR": 0.3029732486075155,
"main/SAD_FR": 0.09184457430214421,
"validation/main/SAD_FR": 0.14026829842315838,
"main/MI": 0.6151607571066049,
"validation/main/MI": 0.3029732486075155,
"main/FA": 0.09277582473866784,
"validation/main/FA": 0.1480348317251118,
"main/CF": 0.07484925383558774,
"validation/main/CF": 0.14297481760414218,
"main/accuracy": 0.7689944064012842,
"validation/main/accuracy": 0.7672909235037653,
"epoch": 10,
"iteration": 1777,
"elapsed_time": 903.8602520569693
Copied my earlier comment.
- How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.
"mini_librispeech" model is prepared just for the code integration tests, not related to the papers. It's better to train a model in the "callhome" recipe. But it requires huge data and training time is needed.
I'm afraid the current code is not intended for the inference-only purpose. For inference, see below:
https://github.com/hitachi-speech/EEND/blob/9a0f211ce7e377eaea242490c3d7ec0f6adab8af/egs/mini_librispeech/v1/run.sh#L106-L117
data/simu/data/dev_clean_2_ns2_beta2_500
is the kaldi-style data directory for inference.
"main/DER": 0.7827858356808605,
means the performance is very poor.
"iteration": 1777,
indicates that your training data size is too small.
ok, can you suggest how much hours of data is needed to build a good speaker diarization system? Also, can we do this without timestamps? As you know getting audio with accurate timestamps is a difficult task. Thanks
We didn't use manual timestamps for simulated mixtures and two-channel recordings. In both cases, we have single-speaker recordings. So we can get timestamps via a speech activity detection system. In our papers, I suggested using a simulated training set of 100k recordings, sampled from large-scale telephone recordings, which have separate channels. The "callhome" recipe is good for general diarization tasks, but two-speaker recordings only. Although we observed the real training set was better than the simulated dataset, we believe that better large-scale simulation with better model architecture can outperform the smaller real training set.
Explanation of Kaldi's data directory: https://kaldi-asr.org/doc/data_prep.html RTTM: https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf
To know how we generate the simulated training data, see
run_prepare_shared.sh
with our paper, particularly Algorithm 1. Training time was not described in the papers. It depends on computing environments. In our experiments, for 100,000 mixtures (generated with beta=2) with 100 epochs, it took 4-6 days.
@yubouf Could you please reveal the GPU you used, so I can roughly estimate the training time in my case?