whisper-ctranslate2 icon indicating copy to clipboard operation
whisper-ctranslate2 copied to clipboard

More natural line-wrapping when using --max_line_width

Open JonasCz opened this issue 1 year ago • 6 comments

By default, Whisper produces subtitles (SRT/VTT) with often quite long line-lengths. For some uses these can be too long for viewers to comfortably read. (a common recommendation is that subtitles should be ~50 characters maximum lenghth). For example, testing with "The Expert"


1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and enhance intangible assets.

3
00:00:08,080 --> 00:00:13,660
In pursuit of these objectives, we've started a new project for which we require seven red lines.

If I want them shorter, I can use something like --max_line_count 2 --max_line_width 50 which does result in very consistent, short lines, but the current line-wrapping implementation results in subtitles which are quite unnatural to read, due to line- and subtitle- breaks not being on (sub)-sentences.


1
00:00:00,000 --> 00:00:05,800
Our company has a new strategic initiative to
increase market penetration, maximise brand

2
00:00:05,800 --> 00:00:11,700
loyalty and enhance intangible assets. In pursuit
of these objectives, we've started a new project

3
00:00:11,700 --> 00:00:16,480
for which we require seven red lines. I understand
your company can help us in this matter. Of

This PR changes this, by wrapping lines in a more natural way, splitting them on periods or commas if possible, and otherwise on the longest gap around the middle of the too-long line. It results in more natural to read text, while staying within the set --max_line_width constraint:

1
00:00:00,000 --> 00:00:04,440
Our company has a new strategic
initiative to increase market penetration,

2
00:00:05,120 --> 00:00:07,720
maximise brand loyalty and
enhance intangible assets.

3
00:00:08,080 --> 00:00:12,060
In pursuit of these objectives,
we've started a new project for which

4
00:00:12,060 --> 00:00:13,660
we require seven red lines.

I've tested that:

  • Diarization output is the same
  • Works regardless of language
  • The JSON output is not changed

I'm not super familiar with Python, so this code is probably not the nicest. Any feedback is appreciated!

JonasCz avatar Jan 17 '24 16:01 JonasCz

Does this work with --highlight_words?

Purfview avatar Feb 12 '24 14:02 Purfview

Yes, testing with --highlight_words True results in "karaoke style" underlined words as expected.

JonasCz avatar Feb 13 '24 17:02 JonasCz

Did you meant underlined and with "more natural line-wrapping"?

Purfview avatar Feb 13 '24 17:02 Purfview

Yes, both together works, i.e. --word_timestamps True --highlight_words True --max_line_count 2 --max_line_width 50 gives underlines and natural line wraps as shown above

JonasCz avatar Feb 13 '24 18:02 JonasCz

Thx, then maybe I'll borrow your PR for my repo to work with "highlight_words" as my implementation of "max_line_width/max_line_count" is not compatible with "highlight_words".

Purfview avatar Feb 13 '24 18:02 Purfview

@JonasCz, nice extension! Does it detect sentence endings besides period, like '?', '!' and even '-' ?

Anyway, it seems that your fork fails to run when --max_line_width is not given, but --word_timestamps is set to True. It can be checked by the following in the base folder of the repo: whisper-ctranslate2 --model medium --language Catalan --output_format srt --word_timestamps True ./e2e-tests/gossos.mp3

It it also worth running the tests and modify them, if needed (right now they fails unfortunately): make run-tests (the following packages are need to be installed first: pip install torch pyannote.audio)

Lycoan avatar Feb 16 '24 08:02 Lycoan