whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

CSV format export trims spaces

Open alex-bacart opened this issue 1 year ago • 2 comments

CSV format export trims leading spaces and it's an issue. vtt and srt formats don't do it.

Command I use to transcribe audio file: ./main --model ./models/ggml-large.bin --file audio.wav --output-csv --max-len 1

Comment on this line https://github.com/ggerganov/whisper.cpp/pull/340/files#diff-2d3599a9fad195f2c3c60bd06691bc1815325b3560b5feda41a91fa71194e805R344 says every time we get a space we should remove it. It's not true in some cases when words are divided in chunks. An example of such a division:

8630, 9070, "greatest"
9070, 9230, "Pon"
9230, 9340, "zi"
9340, 9670, "scheme"
9670, 9780, "in"
9780, 10050, "human"
10050, 10480, "history"

Ponzi is a single word. Here is the same part of transcription using srt format.

31
00:00:08,630 --> 00:00:09,070
 greatest

32
00:00:09,070 --> 00:00:09,230
 Pon

33
00:00:09,230 --> 00:00:09,340
zi

34
00:00:09,340 --> 00:00:09,670
 scheme

35
00:00:09,670 --> 00:00:09,780
 in

36
00:00:09,780 --> 00:00:10,050
 human

37
00:00:10,050 --> 00:00:10,480
 history

Every word/chunk except "zi" has a space before it and it's possible to glue it into correct sentences. Unfortunately csv format doesn't allow to do it.

An issue follows #340 cc @NielsMayer

alex-bacart avatar Jan 23 '23 21:01 alex-bacart

@ggerganov BTW is it ok that whisper.cpp divides "Ponzi" into "Pon" and "zi"? Using --max-len 1 I get tons of such a chunks in transcription results.

alex-bacart avatar Jan 23 '23 21:01 alex-bacart

I'm not a C++ developer but here is my try to fix it #444 take a look please.

alex-bacart avatar Jan 24 '23 17:01 alex-bacart

@alex-bacart The --max-len 1 means to output maximum 1 token per text segment. The word " Ponzi" consists of 2 tokens: Pon and zi and therefore it is being split.

ggerganov avatar Feb 04 '23 06:02 ggerganov