More natural line-wrapping when using --max_line_width
By default, Whisper produces subtitles (SRT/VTT) with often quite long line-lengths. For some uses these can be too long for viewers to comfortably read. (a common recommendation is that subtitles should be ~50 characters maximum lenghth). For example, testing with "The Expert"
1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic initiative to increase market penetration,
2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and enhance intangible assets.
3
00:00:08,080 --> 00:00:13,660
In pursuit of these objectives, we've started a new project for which we require seven red lines.
If I want them shorter, I can use something like --max_line_count 2 --max_line_width 50 which does result in very consistent, short lines, but the current line-wrapping implementation results in subtitles which are quite unnatural to read, due to line- and subtitle- breaks not being on (sub)-sentences.
1
00:00:00,000 --> 00:00:05,800
Our company has a new strategic initiative to
increase market penetration, maximise brand
2
00:00:05,800 --> 00:00:11,700
loyalty and enhance intangible assets. In pursuit
of these objectives, we've started a new project
3
00:00:11,700 --> 00:00:16,480
for which we require seven red lines. I understand
your company can help us in this matter. Of
This PR changes this, by wrapping lines in a more natural way, splitting them on periods or commas if possible, and otherwise on the longest gap around the middle of the too-long line. It results in more natural to read text, while staying within the set --max_line_width constraint:
1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic
initiative to increase market penetration,
2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and
enhance intangible assets.
3
00:00:08,080 --> 00:00:12,060
In pursuit of these objectives,
we've started a new project for which
4
00:00:12,060 --> 00:00:13,660
we require seven red lines.
I've tested that:
- Diarization output is the same
- Works regardless of language
- The JSON output is not changed
I'm not super familiar with Python, so this code is probably not the nicest. Any feedback is appreciated!
Does this work with --highlight_words?
Yes, testing with --highlight_words True results in "karaoke style" underlined words as expected.
Did you meant underlined and with "more natural line-wrapping"?
Yes, both together works, i.e. --word_timestamps True --highlight_words True --max_line_count 2 --max_line_width 50 gives underlines and natural line wraps as shown above
Thx, then maybe I'll borrow your PR for my repo to work with "highlight_words" as my implementation of "max_line_width/max_line_count" is not compatible with "highlight_words".
@JonasCz, nice extension! Does it detect sentence endings besides period, like '?', '!' and even '-' ?
Anyway, it seems that your fork fails to run when --max_line_width is not given, but --word_timestamps is set to True.
It can be checked by the following in the base folder of the repo:
whisper-ctranslate2 --model medium --language Catalan --output_format srt --word_timestamps True ./e2e-tests/gossos.mp3
It it also worth running the tests and modify them, if needed (right now they fails unfortunately):
make run-tests
(the following packages are need to be installed first: pip install torch pyannote.audio)