whisper.cpp Feature Request: split on punctuation when max

Feature Request: split on punctuation when max_len is provided

Open markni opened this issue 1 year ago • 2 comments

Background

#455 added split_on_word when max_len is set. This prevented single word being split into half while using max-len setting.

I think it would be beneficial to have another option to split on punctuation (specifically ,.!?: ) instead. This would greatly improve the quality of machine translation for subtitles, as each line of subtitle now has complete meaning. It will also bring raw subtitle quality closer to human authored ones.

Whisper occasionally produces very long subtitle line with multiple sentences, especially when using the large model. This option can be proven to be very useful in these occasions.

Proposal

when max_len and split_on_punctuation both provided. Split will happen after max len and a punctuation ,.!?: is reached.

Raw

[00:00:00.000 --> 00:00:11.050]  The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD instrisics or CBLAS Accelerate framework routines are used.

max_len and split_on_word set

[00:00:00.000 --> 00:00:03.050] The tensor operators are optimized heavily
[00:00:03.050 --> 00:00:06.050] for Apple silicon CPUs. Depending on the 
[00:00:06.050 --> 00:00:09.050] computation size, Arm Neon SIMD instrisics
[00:00:09.050 --> 00:00:11.050] or CBLAS Accelerate framework routines are used.

max_len and split_on_punctuation set

[00:00:00.000 --> 00:00:04.050] The tensor operators are optimized heavily for Apple silicon CPUs. 
[00:00:04.050 --> 00:00:06.050] Depending on the computation size, 
[00:00:06.050 --> 00:00:11.050] Arm Neon SIMD instrisics or CBLAS Accelerate framework routines are used.

Implementation

split_on_word works by checking for a "empty space" delimiter after max len, this can be re-implemented with ,.!?: as delimiter. However it's not clear to me if time code needs additional adjustment.

Mar 26 '23 18:03 markni

whisper.cpp whisper.cpp copied to clipboard

Feature Request: split on punctuation when max_len is provided

Background

Proposal

Implementation

whisper.cpp
whisper.cpp copied to clipboard