whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Word-level timestamp method `--max-len 1` works bad for CJK language.

Open HaujetZhao opened this issue 2 years ago • 0 comments

In the readme, the current method to get word-level timestamp is using --max-len 1 option.

For CJK languages (probably include other languages that uses non-ascii characters and don't use space as word separator), the --max-len 1 don't work well, it generates lots of unreadable characters.


No max-len result:

>main -m medium.bin -l zh -osrt input.wav
whisper_init_from_file: loading model from 'medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem required  = 1720.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 1462.35 MB
whisper_model_load: model size    = 1462.12 MB

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

main: processing 'input.wav' (111997 samples, 7.0 sec), 4 threads, 1 processors, lang = zh, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:06.000]  趁你现在上课,我现在录一段,刚才我找到整个解决过程

output_srt: saving output to 'input.wav.srt'

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =  1323.14 ms
whisper_print_timings:      mel time =    37.24 ms
whisper_print_timings:   sample time =    20.84 ms /    29 runs (    0.72 ms per run)
whisper_print_timings:   encode time = 30418.15 ms /     1 runs (30418.15 ms per run)
whisper_print_timings:   decode time =  1589.05 ms /    29 runs (   54.79 ms per run)
whisper_print_timings:    total time = 33447.42 ms
1
00:00:00,000 --> 00:00:06,000
趁你现在上课,我现在录一段,刚才我找到整个解决过程


And with the max-len option:

>main -m medium.bin -l zh -osrt --max-len 1 input.wav
whisper_init_from_file: loading model from 'medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem required  = 1720.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 1462.35 MB
whisper_model_load: model size    = 1462.12 MB

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

main: processing 'input.wav' (111997 samples, 7.0 sec), 4 threads, 1 processors, lang = zh, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:00.090]
[00:00:00.090 --> 00:00:00.220]
[00:00:00.220 --> 00:00:00.350]
[00:00:00.350 --> 00:00:00.930]  你
[00:00:00.930 --> 00:00:01.330]  现在
[00:00:01.330 --> 00:00:01.650]  上
[00:00:01.650 --> 00:00:01.870]
[00:00:01.870 --> 00:00:02.000]
[00:00:02.000 --> 00:00:02.100]  ,
[00:00:02.100 --> 00:00:02.250]  我
[00:00:02.250 --> 00:00:02.550]  现在
[00:00:02.550 --> 00:00:02.650]
[00:00:02.650 --> 00:00:02.690]
[00:00:02.690 --> 00:00:02.850]  一
[00:00:02.850 --> 00:00:03.060]  段
[00:00:03.060 --> 00:00:03.140]  ,
[00:00:03.140 --> 00:00:03.350]  刚
[00:00:03.350 --> 00:00:03.560]  才
[00:00:03.560 --> 00:00:03.770]  我
[00:00:03.770 --> 00:00:03.980]  找
[00:00:03.980 --> 00:00:04.210]  到
[00:00:04.210 --> 00:00:04.440]  整
[00:00:04.440 --> 00:00:04.610]  个
[00:00:04.610 --> 00:00:04.820]  解
[00:00:04.820 --> 00:00:04.970]
[00:00:04.970 --> 00:00:05.080]
[00:00:05.080 --> 00:00:05.540]  过
[00:00:05.540 --> 00:00:06.000]  程

output_srt: saving output to 'input.wav.srt'

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =  1286.20 ms
whisper_print_timings:      mel time =    37.94 ms
whisper_print_timings:   sample time =    22.24 ms /    29 runs (    0.77 ms per run)
whisper_print_timings:   encode time = 30423.32 ms /     1 runs (30423.32 ms per run)
whisper_print_timings:   decode time =  1604.10 ms /    29 runs (   55.31 ms per run)
whisper_print_timings:    total time = 33465.68 ms
1
00:00:00,000 --> 00:00:00,090


2
00:00:00,090 --> 00:00:00,220
趍

3
00:00:00,220 --> 00:00:00,350


4
00:00:00,350 --> 00:00:00,930
你

5
00:00:00,930 --> 00:00:01,330
现在

6
00:00:01,330 --> 00:00:01,650
上

7
00:00:01,650 --> 00:00:01,870
词

8
00:00:01,870 --> 00:00:02,000
¾

9
00:00:02,000 --> 00:00:02,100
,

10
00:00:02,100 --> 00:00:02,250
我

11
00:00:02,250 --> 00:00:02,550
现在

12
00:00:02,550 --> 00:00:02,650
彍

13
00:00:02,650 --> 00:00:02,690
•

14
00:00:02,690 --> 00:00:02,850
一

15
00:00:02,850 --> 00:00:03,060
段

16
00:00:03,060 --> 00:00:03,140
,

17
00:00:03,140 --> 00:00:03,350
刚

18
00:00:03,350 --> 00:00:03,560
才

19
00:00:03,560 --> 00:00:03,770
我

20
00:00:03,770 --> 00:00:03,980
找

21
00:00:03,980 --> 00:00:04,210
到

22
00:00:04,210 --> 00:00:04,440
整

23
00:00:04,440 --> 00:00:04,610
个

24
00:00:04,610 --> 00:00:04,820
解

25
00:00:04,820 --> 00:00:04,970
再

26
00:00:04,970 --> 00:00:05,080
³

27
00:00:05,080 --> 00:00:05,540
过

28
00:00:05,540 --> 00:00:06,000
程

So if possible, I wish you could implement the true word-level timestamp feature (mentioned in #375 )

HaujetZhao avatar Apr 14 '23 14:04 HaujetZhao