whisper.cpp
whisper.cpp copied to clipboard
Word-level timestamp method `--max-len 1` works bad for CJK language.
In the readme, the current method to get word-level timestamp is using --max-len 1 option.
For CJK languages (probably include other languages that uses non-ascii characters and don't use space as word separator), the --max-len 1 don't work well, it generates lots of unreadable characters.
No max-len result:
>main -m medium.bin -l zh -osrt input.wav
whisper_init_from_file: loading model from 'medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem required = 1720.00 MB (+ 43.00 MB per decoder)
whisper_model_load: kv self size = 42.00 MB
whisper_model_load: kv cross size = 140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 1462.35 MB
whisper_model_load: model size = 1462.12 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: processing 'input.wav' (111997 samples, 7.0 sec), 4 threads, 1 processors, lang = zh, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:06.000] 趁你现在上课,我现在录一段,刚才我找到整个解决过程
output_srt: saving output to 'input.wav.srt'
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 1323.14 ms
whisper_print_timings: mel time = 37.24 ms
whisper_print_timings: sample time = 20.84 ms / 29 runs ( 0.72 ms per run)
whisper_print_timings: encode time = 30418.15 ms / 1 runs (30418.15 ms per run)
whisper_print_timings: decode time = 1589.05 ms / 29 runs ( 54.79 ms per run)
whisper_print_timings: total time = 33447.42 ms
1
00:00:00,000 --> 00:00:06,000
趁你现在上课,我现在录一段,刚才我找到整个解决过程
And with the max-len option:
>main -m medium.bin -l zh -osrt --max-len 1 input.wav
whisper_init_from_file: loading model from 'medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem required = 1720.00 MB (+ 43.00 MB per decoder)
whisper_model_load: kv self size = 42.00 MB
whisper_model_load: kv cross size = 140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 1462.35 MB
whisper_model_load: model size = 1462.12 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: processing 'input.wav' (111997 samples, 7.0 sec), 4 threads, 1 processors, lang = zh, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:00.090]
[00:00:00.090 --> 00:00:00.220]
[00:00:00.220 --> 00:00:00.350]
[00:00:00.350 --> 00:00:00.930] 你
[00:00:00.930 --> 00:00:01.330] 现在
[00:00:01.330 --> 00:00:01.650] 上
[00:00:01.650 --> 00:00:01.870]
[00:00:01.870 --> 00:00:02.000]
[00:00:02.000 --> 00:00:02.100] ,
[00:00:02.100 --> 00:00:02.250] 我
[00:00:02.250 --> 00:00:02.550] 现在
[00:00:02.550 --> 00:00:02.650]
[00:00:02.650 --> 00:00:02.690]
[00:00:02.690 --> 00:00:02.850] 一
[00:00:02.850 --> 00:00:03.060] 段
[00:00:03.060 --> 00:00:03.140] ,
[00:00:03.140 --> 00:00:03.350] 刚
[00:00:03.350 --> 00:00:03.560] 才
[00:00:03.560 --> 00:00:03.770] 我
[00:00:03.770 --> 00:00:03.980] 找
[00:00:03.980 --> 00:00:04.210] 到
[00:00:04.210 --> 00:00:04.440] 整
[00:00:04.440 --> 00:00:04.610] 个
[00:00:04.610 --> 00:00:04.820] 解
[00:00:04.820 --> 00:00:04.970]
[00:00:04.970 --> 00:00:05.080]
[00:00:05.080 --> 00:00:05.540] 过
[00:00:05.540 --> 00:00:06.000] 程
output_srt: saving output to 'input.wav.srt'
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 1286.20 ms
whisper_print_timings: mel time = 37.94 ms
whisper_print_timings: sample time = 22.24 ms / 29 runs ( 0.77 ms per run)
whisper_print_timings: encode time = 30423.32 ms / 1 runs (30423.32 ms per run)
whisper_print_timings: decode time = 1604.10 ms / 29 runs ( 55.31 ms per run)
whisper_print_timings: total time = 33465.68 ms
1
00:00:00,000 --> 00:00:00,090
2
00:00:00,090 --> 00:00:00,220
趍
3
00:00:00,220 --> 00:00:00,350
4
00:00:00,350 --> 00:00:00,930
你
5
00:00:00,930 --> 00:00:01,330
现在
6
00:00:01,330 --> 00:00:01,650
上
7
00:00:01,650 --> 00:00:01,870
词
8
00:00:01,870 --> 00:00:02,000
¾
9
00:00:02,000 --> 00:00:02,100
,
10
00:00:02,100 --> 00:00:02,250
我
11
00:00:02,250 --> 00:00:02,550
现在
12
00:00:02,550 --> 00:00:02,650
彍
13
00:00:02,650 --> 00:00:02,690
14
00:00:02,690 --> 00:00:02,850
一
15
00:00:02,850 --> 00:00:03,060
段
16
00:00:03,060 --> 00:00:03,140
,
17
00:00:03,140 --> 00:00:03,350
刚
18
00:00:03,350 --> 00:00:03,560
才
19
00:00:03,560 --> 00:00:03,770
我
20
00:00:03,770 --> 00:00:03,980
找
21
00:00:03,980 --> 00:00:04,210
到
22
00:00:04,210 --> 00:00:04,440
整
23
00:00:04,440 --> 00:00:04,610
个
24
00:00:04,610 --> 00:00:04,820
解
25
00:00:04,820 --> 00:00:04,970
再
26
00:00:04,970 --> 00:00:05,080
³
27
00:00:05,080 --> 00:00:05,540
过
28
00:00:05,540 --> 00:00:06,000
程
So if possible, I wish you could implement the true word-level timestamp feature (mentioned in #375 )