whisper.cpp unexpected output with word level timestamps & dual channel (stereo)

unexpected output with word level timestamps & dual channel (stereo)

Open eschmidbauer opened this issue 2 years ago • 2 comments

Hi, thank you for recently adding stereo (dual channel) support. It works great with utterance level timestamps but when i use: -ml 1 to do word level timestamps, I get some unexpected output.

Please see the attached gist for output

Here is the wav file i used (i used ffmpeg to convert it to 16k ffmpeg -i commercial_stereo.wav -ar 16000 commercial_stereo_16k.wav)

Nov 28 '22 16:11 eschmidbauer

Hi, can you clarify which part is unexpected? Is it that the diarization is not as good as if you weren't using -ml 1? It can definitely be improved, I just didn't have any data to test with. I think using this file, I can try to make it a bit better.

Nov 28 '22 16:11 ggerganov

Sorry, i forgot to include that detail. If you look at lines 62,63 96,97

The speaker is not identified.

Nov 28 '22 16:11 eschmidbauer

whisper.cpp whisper.cpp copied to clipboard

unexpected output with word level timestamps & dual channel (stereo)

whisper.cpp
whisper.cpp copied to clipboard