whisper.cpp
whisper.cpp copied to clipboard
whisper : mark speakers/voices (diarization)
Hi,
I'm not so much into the details of whisper or whisper.cpp and I don't know if it is currently even possible with the foundation, but it would be nice if speakers could be marked or speaker-changes / voice-changes.
This would be very handy when processing interviews, radio/tv shows, films, etc.
Kind regards, abelbabel
I think its a very not easy task - about quality. I recomend use for this another model. But in my research of this field, now not exist very good open source solution for this. But u can check pyannote for this. Some already implemented it with whisper usage: https://github.com/Majdoddin/nlp
yeah, also saw this
https://github.com/openai/whisper/discussions/264
Seems as if they do it with two runs: one for the spoken text, one for the speakers and then merging the results.
Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice.
@jaybinks This can be added very easily as a built-in option. A naive algorithm would be for each transcribed segment to measure the signal energy during the time interval for that segment in the 2 channels and predict the speaker based on which one is bigger.
One option would be to use pyannote.audio to diarize first --> then run whisper on each recognized section @abelbabel
@jaybinks
Added support for stereo-channel diarization - add the --diarize
argument to main
.
Not sure if it works, because I don't have any data to test with
Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice.
Does this approach have the assumption that you only have two speakers and each speaker is well separated each on a single channel? This is a special case which is only applicable to special recordings in an audio studio - from my point of view. Or am I wrong?
This absolutely is a special case, but its also simple to implement and allows the problem to be broken up.
I'm lucky that in my scenario, I have a separate mic per speaker in the conversation so it's perfectly isolated.
On Sun, 27 Nov 2022, 9:51 am abelbabel, @.***> wrote:
Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice.
Does this approach have the assumption that you only have two speakers and each speaker is well separated each on a single channel? This is a special case which is only applicable to special recordings in an audio studio - from my point of view. Or am I wrong?
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/64#issuecomment-1328134450, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR62XRG2NRLGNR5BEUQLWKKO7HANCNFSM6AAAAAARH4FNAI . You are receiving this because you were mentioned.Message ID: @.***>
I've done some limited testing and was able to achieve reasonable split via pyannote
.
Bolting it all together is a different story though.
Interestingly, in a mono-channel with two speakers, 1st speaker says three words, second speaker repeats those three words, and the transcript result is three words, expanded to the time of the two speakers as though a kind of DTW were in operation. Sigh, WAV unsupported file type, so mp4. https://user-images.githubusercontent.com/2199766/206061513-9afff328-ef22-40a8-9d80-727e65cf6dbc.mp4
WEBVTT
00:00:00.000 --> 00:00:04.000 No ifs ands or
00:00:04.000 --> 00:00:08.000 buts. The above doesn't use --diarize of course.
@chris-english
I tired running the original PyTorch implementation with and without beam search and sometimes it gets the second phrase, but sometimes it does not, so I think it is a limitation of the model (or the decoding strategy) and not whisper.cpp
:
Results with OpenAI Whisper
12:04:18 $ time whisper --model base.en --best_of None --beam_size None ~/Downloads/repit_12.wav
[00:00.000 --> 00:08.000] No ifs ands or buts.
real 0m1.713s
user 0m4.271s
sys 0m0.527s
12:04:23 $ time whisper --model base.en ~/Downloads/repit_12.wav
[00:00.000 --> 00:05.000] No ifs ands or buts.
[00:05.000 --> 00:07.000] No ifs ands or buts.
[00:07.000 --> 00:34.000] Okay.
real 0m3.834s
user 0m8.992s
sys 0m3.402s
12:04:32 $ time whisper --model medium.en --best_of None --beam_size None ~/Downloads/repit_12.wav
[00:00.000 --> 00:08.000] No ifs, ands or buts.
real 0m8.247s
user 0m15.943s
sys 0m2.499s
12:04:56 $ time whisper --model medium.en --beam_size None ~/Downloads/repit_12.wav
[00:00.000 --> 00:08.000] No ifs, ands or buts.
real 0m8.280s
user 0m14.941s
sys 0m3.509s
12:05:17 $ time whisper --model medium.en ~/Downloads/repit_12.wav
[00:00.000 --> 00:08.000] No ifs, ands or buts.
real 0m18.790s
user 0m44.693s
sys 0m16.823s
12:05:39 $ time whisper --model large ~/Downloads/repit_12.wav
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:08.000] No ifs, ands or buts.
Im so sorry this took ages for me to test for you... but the detection seems to work PERFECTLY!
Sorry, I cant comment for the output file formats for multi-speaker ( srt, vtt etc ) as I don't know these file formats.
I'm assuming that the speaker is available in the segment callback?
On Sat, 26 Nov 2022 at 06:11, Georgi Gerganov @.***> wrote:
@jaybinks https://github.com/jaybinks Added support for stereo-channel diarization - add the --diarize argument to main. Not sure if it works, because I don't have any data to test with
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/64#issuecomment-1327861412, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR67RDHYOMVQSR4SVS43WKEMNZANCNFSM6AAAAAARH4FNAI . You are receiving this because you were mentioned.Message ID: @.***>
-- Sincerely
Jay
Great to hear! Btw, a failure case has been identified earlier when multiple speakers end up in the same segment: https://github.com/ggerganov/whisper.cpp/issues/216#issuecomment-1335660925
Overall, this is a pretty basic approach and probably not worth investing too much time in it. I have some ideas for a more general speaker detection approach at the audio embedding level, but not sure if I'll get to that anytime soon. Will see
I've done some limited testing and was able to achieve reasonable split via
pyannote
. Bolting it all together is a different story though.
@savchenko Could you give a small how-to on how you used pyannote
? By the way: does pyannote
require a GPU or can it be used like whisper.cpp with a CPU-only?
In my testing pyannote.audio is extremely slow on CPU. Very interested if anyone finds a way to make it work.
@abelbabel , https://gist.github.com/savchenko/f009a01bba39e8cd5c7f53267071130a
@ggerganov When running whisper.cpp, I get the speaker information only on the stdout result (I think it is VTT format), but the output JSON file does not include this.
Is there a way to show the speaker information in the JSON format?
I am not into technical specifics, just a user of an AI transcription tool that uses this library. For me it would be perfect if the system could detect different speakers and just label the line's where a new speaker starts. similar to the time stamps. Fingers crossed that will works sometime soon :-)
Hi @ggerganov (and other maintainers of this awesome project!) - you might be interested in an early prototype that covers @SpusellaLo's use case over at https://github.com/akashmjn/tinydiarize
This was designed keeping in mind ease of integration into whisper.cpp as the model structure is exactly the same, inference requires no extra dependencies (beyond the original repo), and it has marginal extra runtime cost.
It can be run as whisper --model small.en-tdrz AUDIO
, the only change is the small.en-tdrz
model instead of small.en
.
Let me know what you think!
Note that this is an early prototype, so while it has quite decent quality, there are still some rough edges. However it should be functionally complete enough to start testing an integration.
@akashmjn
Exciting to see this!
Let me know if there is anything I can help with, for example adding whisper.cpp
integration or testing
this is great! model weights seem to be available here: https://sharedstorage7190.blob.core.windows.net/tinydiarize/whisper/models/53dfb0a7f5393bd3612173f84cad3fa2b347a3106b53c116628ead31641e9a53/small.en-tdrz.pt
Exciting to hear back so soon! 🥳
I'm going to be travelling next couple of days, so will take a closer look after i'm back on Monday and hit you up as I run into things.
For reference, inference code changes are here https://github.com/akashmjn/tinydiarize/pull/4 (minor edits to tokenizer and suppressed tokens during decoding).
@akashmjn Great work!! I converted the small.en-trdz.pt to ggml using the whisper.cpp python script. I used the newly generated ggml model with whisper.cpp using the -m option but it doesn't seem to work. May be there is something else that I missing besides converting it to ggml?
Thanks for the effort @pratikmohanty. The small.en-tdrz
checkpoint has the same structure, so it should convert and decode as normal.
However to surface <|speakerturn|>
tokens, edits are required to inference code to allow them to be appropriately decoded and rendered.
Here's a high-level implementation plan:
- configurable remap of the unused
vocab.solm
token (that has been repurposed for speaker turns) https://github.com/ggerganov/whisper.cpp/blob/57543c169e27312e7546d07ed0d8c6eb806ebc36/whisper.cpp#L382 - update all places where this token is suppressed and add another rule to timestamp logit filtering https://github.com/akashmjn/tinydiarize/pull/11 https://github.com/ggerganov/whisper.cpp/blob/57543c169e27312e7546d07ed0d8c6eb806ebc36/whisper.cpp#L3548
- update rendering of token ids to text as appropriate https://github.com/ggerganov/whisper.cpp/blob/57543c169e27312e7546d07ed0d8c6eb806ebc36/whisper.cpp#L4539-L4542
I'm wrapping up some things on my original repo after which I'll have a draft PR open shortly.
In the meantime @ggerganov - how does this sound? Feel free to add any other code pointers in case there's something i've missed!
@akashmjn that looks amazing! Can't wait to see how this performs!
For anyone keen to give it a spin, I have an early hack over at https://github.com/akashmjn/whisper.cpp/tree/tdrz-hack-1
make
./models/download-ggml-model.sh small.en-tdrz
make samples
./main -m models/ggml-small.en-tdrz.bin -f samples/a13.wav
After running the above, you should see this:
(tried to pick a sample keeping with the historical vibe of the others 😉 )
Will open a PR after some cleanup. In the meantime if you have any suggestions - feel free to drop comments directly on the branch!
Awesome stuff! Looked at the branch - seems super clean
@ggerganov When running whisper.cpp, I get the speaker information only on the stdout result (I think it is VTT format), but the output JSON file does not include this.
Is there a way to show the speaker information in the JSON format?
:+1, it would be great if the speaker details would be present in the JSON output. Currently it's hard to make use of them.
@ggerganov When running whisper.cpp, I get the speaker information only on the stdout result (I think it is VTT format), but the output JSON file does not include this. Is there a way to show the speaker information in the JSON format?
:+1, it would be great if the speaker details would be present in the JSON output. Currently it's hard to make use of them.
I assume you are referring to previous comment pertaining to the --diarize
flag that currently preserves speaker/channel tags when processing a stereo audio file? If so, I believe it was fixed recently in https://github.com/ggerganov/whisper.cpp/pull/1031.
For tinydiarize
(that handles a mono audio file) i'm implementing something similar so speaker turns are marked in the output file. I'm adding a field to each JSON segment as below.
Example
{
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:03,820"
},
"offsets": {
"from": 0,
"to": 3820
},
"text": " Then these neural nets take on pretty surprising magical",
"speaker_turn_next": true
},
For the rest of the output types (txt/vtt/srt/lrc/wts/csv) - it will only be present in the text transcription as you saw in the apollo example above. Hope that works.
@akashmjn Yes indeed, thanks for the pointer!