kaldi-offline-transcriber Multi-core, multi-threading

8-core machine could plow through diarization faster if parallelized - what's the biggest complexity stopping us from having it?

Sep 27 '17 11:09 lkraav

By far the most time-consuming part of speaker diarization is the last step -- NCLR clustering. I don't know if this algorithm is easily parallelizable or not.

However, current speaker recognition models are not highly sensitive to absolutely correct speaker diarization, so you could actually omit NCLR clustering (and gender identification), and use show.spl.seg instead of show.seg as the diarization result. This would save you about 80% of the time.

Sep 27 '17 14:09 alumae

@alumae thanks. Related to multi-threading ability, I'm also seeing crashes at a later stage:

[982571.380092] nnet3-latgen-fa[9579]: segfault at 0 ip 00007f3469bba1ab sp 00007f34367fbb60 error 6 in libopenblas_openmp_haswellp-r0.2.20.so[7f346996a000+3f5000]

Googling shows that openblas may have trouble with multithreading (at least w/ openmp enabled, which I have). Do you happen to have any experience with segfaults in the process? I'm testing running speech2text.sh with OMP_NUM_THREADS=1, but not very hopeful for it helping. Should probably test with a small sample audio file, too.

Sep 28 '17 09:09 lkraav

No, I haven't seen this. I usually use Intel's MKL, not OpenBLAS but of course it might not be possible for you.

Note that you can use parallel decoding (instead of multithreaded) if you set e.g. njobs=4 in Makefile.options, but I think then you could run into problems if your audio file has less than njobs speakers.

Sep 28 '17 09:09 alumae

Actually, it's probably OK to use parallel decoding even with less than njobs speakers. But if you have less than njobs utterances (segments), it could fail.

Sep 28 '17 09:09 alumae

It seems like eliminating --nthreads and using OMP_NUM_THREADS=1 worked. I will now transcribe another file with only a single variable. Perhaps it was --nthreads 8 all along.

Sep 28 '17 10:09 lkraav

kaldi-offline-transcriber
kaldi-offline-transcriber copied to clipboard

Multi-core, multi-threading - possible?

kaldi-offline-transcriber kaldi-offline-transcriber copied to clipboard

Multi-core, multi-threading - possible?

kaldi-offline-transcriber
kaldi-offline-transcriber copied to clipboard