Whisper-WebUI icon indicating copy to clipboard operation
Whisper-WebUI copied to clipboard

Standard captions format for diarization

Open furyus opened this issue 5 months ago • 2 comments

Whisper has been incredibly good at transcription and subtitles, but the diarization feature, using https://huggingface.co/pyannote/speaker-diarization-3.0, looks like it isn't really reliable yet. I've done a few tests and all of them get the speaker identification wrong. For instance, there's a clear change in speaker, but both subtitles say "SPEAKER_01" before them, and it should instead be "SPEAKER_01" and then "SPEAKER_02."

But I'm not sure solving that is as important as just doing captions format correctly. Is there any chance of getting an option where Whisper simply adds a hypen and a space at the start of each change in speaker? I believe this is the standard format for captions with more than one speaker.

To illustrate, currently it's trying to do something like this:

1
00:00:00,000 --> 00:00:03,480
SPEAKER_01|Hi, my name is Bob.

2
00:00:03,480 --> 00:00:07,839
SPEAKER_02|And my name is Jerry...

3
00:00:07,839 --> 00:00:08,839
SPEAKER_02|...and I'm happy to be here.

It would be fantastic if it did this instead:

1
00:00:00,000 --> 00:00:03,480
- Hi, my name is Bob.

2
00:00:03,480 --> 00:00:07,839
- And my name is Jerry...

3
00:00:07,839 --> 00:00:08,839
...and I'm happy to be here.

furyus avatar Oct 03 '24 05:10 furyus