whisper-ctranslate2 More natural line-wrapping when using --max_line

By default, Whisper produces subtitles (SRT/VTT) with often quite long line-lengths. For some uses these can be too long for viewers to comfortably read. (a common recommendation is that subtitles should be ~50 characters maximum lenghth). For example, testing with "The Expert"


1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and enhance intangible assets.

3
00:00:08,080 --> 00:00:13,660
In pursuit of these objectives, we've started a new project for which we require seven red lines.

If I want them shorter, I can use something like --max_line_count 2 --max_line_width 50 which does result in very consistent, short lines, but the current line-wrapping implementation results in subtitles which are quite unnatural to read, due to line- and subtitle- breaks not being on (sub)-sentences.


1
00:00:00,000 --> 00:00:05,800
Our company has a new strategic initiative to
increase market penetration, maximise brand

2
00:00:05,800 --> 00:00:11,700
loyalty and enhance intangible assets. In pursuit
of these objectives, we've started a new project

3
00:00:11,700 --> 00:00:16,480
for which we require seven red lines. I understand
your company can help us in this matter. Of

This PR changes this, by wrapping lines in a more natural way, splitting them on periods or commas if possible, and otherwise on the longest gap around the middle of the too-long line. It results in more natural to read text, while staying within the set --max_line_width constraint:

1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic
initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and
enhance intangible assets.

3
00:00:08,080 --> 00:00:12,060
In pursuit of these objectives,
we've started a new project for which

4
00:00:12,060 --> 00:00:13,660
we require seven red lines.

I've tested that:

Diarization output is the same
Works regardless of language
The JSON output is not changed

I'm not super familiar with Python, so this code is probably not the nicest. Any feedback is appreciated!

Jan 17 '24 16:01 JonasCz

Does this work with --highlight_words?

Feb 12 '24 14:02 Purfview

Yes, testing with --highlight_words True results in "karaoke style" underlined words as expected.

Feb 13 '24 17:02 JonasCz

Did you meant underlined and with "more natural line-wrapping"?

Feb 13 '24 17:02 Purfview

Yes, both together works, i.e. --word_timestamps True --highlight_words True --max_line_count 2 --max_line_width 50 gives underlines and natural line wraps as shown above

Feb 13 '24 18:02 JonasCz

Thx, then maybe I'll borrow your PR for my repo to work with "highlight_words" as my implementation of "max_line_width/max_line_count" is not compatible with "highlight_words".

Feb 13 '24 18:02 Purfview

@JonasCz, nice extension! Does it detect sentence endings besides period, like '?', '!' and even '-' ?

Anyway, it seems that your fork fails to run when --max_line_width is not given, but --word_timestamps is set to True. It can be checked by the following in the base folder of the repo: whisper-ctranslate2 --model medium --language Catalan --output_format srt --word_timestamps True ./e2e-tests/gossos.mp3

It it also worth running the tests and modify them, if needed (right now they fails unfortunately): make run-tests (the following packages are need to be installed first: pip install torch pyannote.audio)

Feb 16 '24 08:02 Lycoan

More natural line-wrapping when using --max_line_width