whisper.cpp
whisper.cpp copied to clipboard
CSV format export trims spaces
CSV format export trims leading spaces and it's an issue. vtt and srt formats don't do it.
Command I use to transcribe audio file: ./main --model ./models/ggml-large.bin --file audio.wav --output-csv --max-len 1
Comment on this line https://github.com/ggerganov/whisper.cpp/pull/340/files#diff-2d3599a9fad195f2c3c60bd06691bc1815325b3560b5feda41a91fa71194e805R344 says every time we get a space we should remove it. It's not true in some cases when words are divided in chunks. An example of such a division:
8630, 9070, "greatest"
9070, 9230, "Pon"
9230, 9340, "zi"
9340, 9670, "scheme"
9670, 9780, "in"
9780, 10050, "human"
10050, 10480, "history"
Ponzi is a single word.
Here is the same part of transcription using srt format.
31
00:00:08,630 --> 00:00:09,070
greatest
32
00:00:09,070 --> 00:00:09,230
Pon
33
00:00:09,230 --> 00:00:09,340
zi
34
00:00:09,340 --> 00:00:09,670
scheme
35
00:00:09,670 --> 00:00:09,780
in
36
00:00:09,780 --> 00:00:10,050
human
37
00:00:10,050 --> 00:00:10,480
history
Every word/chunk except "zi" has a space before it and it's possible to glue it into correct sentences. Unfortunately csv format doesn't allow to do it.
An issue follows #340 cc @NielsMayer
@ggerganov BTW is it ok that whisper.cpp divides "Ponzi" into "Pon" and "zi"? Using --max-len 1 I get tons of such a chunks in transcription results.
I'm not a C++ developer but here is my try to fix it #444 take a look please.
@alex-bacart
The --max-len 1 means to output maximum 1 token per text segment.
The word " Ponzi" consists of 2 tokens: Pon and zi and therefore it is being split.