whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Prevent word splitting when using max-len option

Open mightymatth opened this issue 1 year ago • 1 comments

Description

In the Croatian language, we often get short sub-word tokens. When we choose max-len option, words often get split, and we don't want that.

It looks like this:

[00:00:00.000 --> 00:00:01.050]   Zagreb se nalazi u
[00:00:01.050 --> 00:00:02.310]   kontinentalnoj Sred
[00:00:02.310 --> 00:00:03.740]  išnjoj Hrvatskoj,
[00:00:03.740 --> 00:00:04.640]   na južnim obronc
[00:00:04.640 --> 00:00:05.960]  ima Medvednice te na
[00:00:05.960 --> 00:00:07.500]   obalama rijeke Save
[00:00:07.500 --> 00:00:07.500]  .

To fix that, the idea is to split the segment only if we reach the limit set with max-len AND when the next token starts with a delimiter (currently a blank space). We don't want to split it before reaching the max length because we can get stuck if our wanted max length is lower than a single word.

Now it looks like this:

[00:00:00.000 --> 00:00:01.050]  Zagreb se nalazi u
[00:00:01.050 --> 00:00:02.810]  kontinentalnoj Središnjoj
[00:00:02.810 --> 00:00:04.220]  Hrvatskoj, na južnim
[00:00:04.220 --> 00:00:05.610]  obroncima Medvednice
[00:00:05.610 --> 00:00:07.130]  te na obalama rijeke
[00:00:07.130 --> 00:00:07.500]  Save.

I also trimmed the output line.

The file is located here, and the command to run it is:

 ./main -m ./models/ggml-large.bin -l hr -ml 20 -f ./croatian-sample-short.wav

Please, modify and refactor it as needed, this is just the idea.

mightymatth avatar Jan 27 '23 01:01 mightymatth

@mightymatth Thanks for this contribution - I think this is very useful! Although it is OK to merge like this, I will likely change it to have a bool flag in the params that specifies whether to do token or word splitting in order to keep both functionalities. If you feel like it, you can make the change yourself. If not, I will do it at some point the in the following days :)

ggerganov avatar Feb 04 '23 07:02 ggerganov

I've added --split-on-word (--sow) flag to turn this on (it's turned off by default). You can check these commits. You are free to modify this solution in terms of naming, positioning, and coding style.

mightymatth avatar Feb 04 '23 22:02 mightymatth

When --split-on-word is false, the space at the beginning of the text output is necessary to know whether or not the current output is a continuation of the previous word or a new word.

Because that space is now trimmed with this PR, the --split-on-word false case is now broken.

Before:

1
00:00:01,580 --> 00:00:01,650
 N

2
00:00:01,650 --> 00:00:01,670
erv

3
00:00:01,670 --> 00:00:02,110
ous

4
00:00:02,110 --> 00:00:02,390
,

5
00:00:02,390 --> 00:00:02,410
 Ch

6
00:00:02,410 --> 00:00:02,560
az

7
00:00:02,560 --> 00:00:03,120
 said

Now:

1
00:00:01,580 --> 00:00:01,650
N

2
00:00:01,650 --> 00:00:01,670
erv

3
00:00:01,670 --> 00:00:02,110
ous

4
00:00:02,110 --> 00:00:02,390
,

5
00:00:02,390 --> 00:00:02,410
Ch

6
00:00:02,410 --> 00:00:02,560
az

7
00:00:02,560 --> 00:00:03,120
said

Note how the spaces delineate the start of the words "Nervous" and "Chaz".

boolemancer avatar Feb 06 '23 08:02 boolemancer

...the space at the beginning of the text output is necessary to know whether or not the current output is a continuation of the previous word or a new word.

I didn't know that the leading space has a practical meaning, so I proposed to have it trimmed. However, if it is the case, should we revert the trimming of the output at all? or have it also under the flag?

mightymatth avatar Feb 06 '23 09:02 mightymatth

However, if it is the case, should we revert the trimming of the output at all? or have it also under the flag?

You're not the only one that was confused by the space. Issue #397 also questions why they're there. :)

It doesn't seem like the leading space is necessary if you actually do only split on word boundaries, so I think it's probably fine if it's also under the flag.

I went ahead and created a PR #476.

boolemancer avatar Feb 06 '23 09:02 boolemancer

I support both solutions; the one proposed in your PR and having it behind the flag. However, having it behind the flag might be redundant because you won't need that info (+ there are already many flags, not sure if we need another one). We can see how others see it.

mightymatth avatar Feb 06 '23 09:02 mightymatth