whisper.cpp
whisper.cpp copied to clipboard
Feature Request: split on punctuation when max_len is provided
Background
#455 added split_on_word
when max_len
is set. This prevented single word being split into half while using max-len setting.
I think it would be beneficial to have another option to split on punctuation (specifically ,.!?:
) instead. This would greatly improve the quality of machine translation for subtitles, as each line of subtitle now has complete meaning. It will also bring raw subtitle quality closer to human authored ones.
Whisper occasionally produces very long subtitle line with multiple sentences, especially when using the large model. This option can be proven to be very useful in these occasions.
Proposal
when max_len
and split_on_punctuation
both provided. Split will happen after max len and a punctuation ,.!?:
is reached.
Raw
[00:00:00.000 --> 00:00:11.050] The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD instrisics or CBLAS Accelerate framework routines are used.
max_len
and split_on_word
set
[00:00:00.000 --> 00:00:03.050] The tensor operators are optimized heavily
[00:00:03.050 --> 00:00:06.050] for Apple silicon CPUs. Depending on the
[00:00:06.050 --> 00:00:09.050] computation size, Arm Neon SIMD instrisics
[00:00:09.050 --> 00:00:11.050] or CBLAS Accelerate framework routines are used.
max_len
and split_on_punctuation
set
[00:00:00.000 --> 00:00:04.050] The tensor operators are optimized heavily for Apple silicon CPUs.
[00:00:04.050 --> 00:00:06.050] Depending on the computation size,
[00:00:06.050 --> 00:00:11.050] Arm Neon SIMD instrisics or CBLAS Accelerate framework routines are used.
Implementation
split_on_word
works by checking for a "empty space" delimiter after max len, this can be re-implemented with ,.!?:
as delimiter.
However it's not clear to me if time code needs additional adjustment.