whisper.cpp
whisper.cpp copied to clipboard
UTF8 issue with command line parameters in Windows version
If I pass the file "Chinese audio (中文).mp3" to the windows command line version, it exits with an errors:
rem Here main.exe has been renamed to whisper.exe
C:\...\whisp>whisper.exe --model models\ggml-tiny.bin --language chinese "Chinese file (中文).mp3"
whisper_init_from_file: loading model from 'models\ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem required = 127.00 MB (+ 3.00 MB per decoder)
whisper_model_load: kv self size = 2.62 MB
whisper_model_load: kv cross size = 8.79 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 73.58 MB
whisper_model_load: model size = 73.54 MB
error: failed to open 'Chinese file (??).mp3' as WAV file
error: failed to read WAV file 'Chinese file (??).mp3'
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 398.52 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 399.54 ms
Runs fine when I rename the file omitting the chinese logograms.
I've also tried setting the codepage to UTF-8 with chcp 65001
with no luck.
(MacOS version works fine)
whisper.cpp
does not support .mp3
files. The input has to be 16 kHz WAV
whisper.cpp
does not support.mp3
files. The input has to be 16 kHz WAV
Yes, apologies, I copied the wrong output example.
This one is with a 16KHz wav file.
The problem are IMHO unicode characters in the file name:
D:\Tools\WhisperGUI>bin\whisper.exe --model models\ggml-tiny.en.bin --language en ..\file_中文.wav
whisper_init_from_file: loading model from 'models\ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem required = 127.00 MB (+ 3.00 MB per decoder)
whisper_model_load: kv self size = 2.62 MB
whisper_model_load: kv cross size = 8.79 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 73.58 MB
whisper_model_load: model size = 73.54 MB
error: failed to open '..\file_??.wav' as WAV file
error: failed to read WAV file '..\file_??.wav'
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 494.26 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 495.03 ms
D:\Tools\WhisperGUI>copy ..\file_中文.wav ..\file.wav
1 file copiati.
D:\Tools\WhisperGUI>bin\whisper.exe --model models\ggml-tiny.en.bin --language en ..\file.wav
whisper_init_from_file: loading model from 'models\ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem required = 127.00 MB (+ 3.00 MB per decoder)
whisper_model_load: kv self size = 2.62 MB
whisper_model_load: kv cross size = 8.79 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 73.58 MB
whisper_model_load: model size = 73.54 MB
system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: processing '..\file.wav' (63793110 samples, 3987.1 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:04.000] [speaking in foreign language]
[00:00:04.000 --> 00:00:08.000] [speaking in foreign language]
^C
This happens for the windows version only.
While the output on CMD terminal is correct, if you run chcp 65001
, the bug is present (maybe just in parsing argv?) with --output-file
, also:
D:\Tools\WhisperGUI>bin\main.exe --language chinese --model "models\ggml-tiny.bin" --output-file "D:\Lavori\result (中文)" -osrt -ovtt -otxt "D:\Lavori\5min.wav"
whisper_init_from_file: loading model from 'models\ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem required = 127.00 MB (+ 3.00 MB per decoder)
whisper_model_load: kv self size = 2.62 MB
whisper_model_load: kv cross size = 8.79 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 73.58 MB
whisper_model_load: model size = 73.54 MB
system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: processing 'D:\Lavori\5min.wav' (4709599 samples, 294.3 sec), 4 threads, 1 processors, lang = chinese, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:02.000] 我現在回想
[00:00:02.000 --> 00:00:04.000] 我做信用電話
[00:00:04.000 --> 00:00:06.000] 事大美說是畢業以後
[00:00:06.000 --> 00:00:07.000] 我也錯過
[00:00:07.000 --> 00:00:10.000] 可是那個時候很自然人
- snip -
[00:04:44.080 --> 00:04:47.080] 这么多年来原门酒店天下
[00:04:47.080 --> 00:04:49.080] 加统夫罚但都问
[00:04:49.080 --> 00:04:51.080] 这么漂亮的字是谁写的
[00:04:51.080 --> 00:04:53.080] 是
[00:04:53.080 --> 00:04:54.280] 312歲的
output_txt: failed to open 'D:\Lavori\result (??).txt' for writing
output_vtt: failed to open 'D:\Lavori\result (??).vtt' for writing
output_srt: failed to open 'D:\Lavori\result (??).srt' for writing
whisper_print_timings: fallbacks = 17 p / 33 h
whisper_print_timings: load time = 486.94 ms
whisper_print_timings: mel time = 2188.15 ms
whisper_print_timings: sample time = 15614.09 ms / 7737 runs ( 2.02 ms per run)
whisper_print_timings: encode time = 14130.27 ms / 14 runs ( 1009.31 ms per run)
whisper_print_timings: decode time = 40749.71 ms / 7685 runs ( 5.30 ms per run)
whisper_print_timings: total time = 73315.52 ms
hmmmmm
even with chcp 65001
it still get bugged
shenjack whisper-bin-x64 ➜ ( master) ♥ 23:26 .\main.exe -m .\ggml-small.bin -f .\001.wav -t 12 -ocsv -of 001-small-cn -l auto -pp
whisper_init_from_file_no_state: loading model from '.\ggml-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 768
whisper_model_load: n_text_head = 12
whisper_model_load: n_text_layer = 12
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 3
whisper_model_load: mem required = 608.00 MB (+ 16.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 464.56 MB
whisper_model_load: model size = 464.44 MB
whisper_init_state: kv self size = 15.75 MB
whisper_init_state: kv cross size = 52.73 MB
system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | COREML = 0 |
main: processing '.\001.wav' (870741 samples, 54.4 sec), 12 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...
whisper_full_with_state: auto-detected language: zh (p = 0.964870)
[00:00:00.000 --> 00:00:01.800] 涓夋閮芥槸鍦ㄥ鏍′腑涓
[00:00:01.800 --> 00:00:06.000] 鎵嶆渻鏈変簡灏嶆柤瀛哥敓楂瀷绠$悊鐨勯€欏€嬫浠剁殑鏀
[00:00:06.000 --> 00:00:08.400] 灏辨槸鍦ㄤ笂涓婂眴鍓涚暍妤殑
[00:00:08.400 --> 00:00:10.600] 浠栧€戞妸鐢风敓鐨勯鍨嬫婧
[00:00:10.600 --> 00:00:13.000] 鐢变笁鍏垎瑾垮埌浜嗗叚鍏垎
[00:00:13.000 --> 00:00:16.000] 瑕佷笉鐒跺氨鎴戝啀娆℃斁涓€缍插ソ澶氫汉璁€鏂囬潻
[00:00:16.000 --> 00:00:17.600] 鍗充娇鐝惧湪鏄畝鏂囬潻涔熶笉鍚堟牸
[00:00:17.600 --> 00:00:19.000] 鐒℃剰寰屽洖浜嗙Ξ鎷
[00:00:19.000 --> 00:00:21.000] 閭f垜瀹e竷涓€涓嬮€欏牬姣旇辰绲愭灉
[00:00:21.000 --> 00:00:26.600] 鐛插緱鏈€浣宠畩鎴愮殑鏄鏂逛簩璁婂叏绁
whisper_full_with_state: progress = 5%
whisper_full_with_state: progress = 10%
whisper_full_with_state: progress = 15%
whisper_full_with_state: progress = 20%
whisper_full_with_state: progress = 25%
whisper_full_with_state: progress = 30%
whisper_full_with_state: progress = 35%
whisper_full_with_state: progress = 40%
whisper_full_with_state: progress = 45%
[00:00:26.600 --> 00:00:31.600] (鎺岃伈)
[00:00:31.600 --> 00:00:43.000] 鏈€寰岀嵅鍕濈殑鏄鏂
[00:00:43.000 --> 00:00:50.600] (鎺岃伈)
[00:00:50.600 --> 00:00:52.200] 閭g従鍦ㄦ槸涓嶆槸鏂囬潻
[00:00:52.200 --> 00:00:53.400] 鏂囬潻
[00:00:53.400 --> 00:00:54.400] 閭e挶鍊戞墦
whisper_full_with_state: progress = 50%
whisper_full_with_state: progress = 55%
whisper_full_with_state: progress = 60%
whisper_full_with_state: progress = 65%
whisper_full_with_state: progress = 70%
whisper_full_with_state: progress = 75%
whisper_full_with_state: progress = 80%
whisper_full_with_state: progress = 85%
whisper_full_with_state: progress = 90%
whisper_full_with_state: progress = 95%
output_csv: saving output to '001-small-cn.csv'
whisper_print_timings: load time = 360.42 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 132.76 ms
whisper_print_timings: sample time = 107.02 ms / 186 runs ( 0.58 ms per run)
whisper_print_timings: encode time = 8239.76 ms / 3 runs ( 2746.59 ms per run)
whisper_print_timings: decode time = 7407.09 ms / 187 runs ( 39.61 ms per run)
whisper_print_timings: total time = 16274.10 ms
shenjack whisper-bin-x64 ➜ ( master) ♥ 23:28 chcp
活动代码页: 65001
I use the main.exe in release 1.3.0 win-x64
hmmmmm even with
chcp 65001
it still get buggedshenjack whisper-bin-x64 ➜ ( master) ♥ 23:26 .\main.exe -m .\ggml-small.bin -f .\001.wav -t 12 -ocsv -of 001-small-cn -l auto -pp whisper_init_from_file_no_state: loading model from '.\ggml-small.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51865 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 768 whisper_model_load: n_audio_head = 12 whisper_model_load: n_audio_layer = 12 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 768 whisper_model_load: n_text_head = 12 whisper_model_load: n_text_layer = 12 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 3 whisper_model_load: mem required = 608.00 MB (+ 16.00 MB per decoder) whisper_model_load: adding 1608 extra tokens whisper_model_load: model ctx = 464.56 MB whisper_model_load: model size = 464.44 MB whisper_init_state: kv self size = 15.75 MB whisper_init_state: kv cross size = 52.73 MB system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | COREML = 0 | main: processing '.\001.wav' (870741 samples, 54.4 sec), 12 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ... whisper_full_with_state: auto-detected language: zh (p = 0.964870) [00:00:00.000 --> 00:00:01.800] 涓夋閮芥槸鍦ㄥ鏍′腑涓 [00:00:01.800 --> 00:00:06.000] 鎵嶆渻鏈変簡灏嶆柤瀛哥敓楂瀷绠$悊鐨勯€欏€嬫浠剁殑鏀 [00:00:06.000 --> 00:00:08.400] 灏辨槸鍦ㄤ笂涓婂眴鍓涚暍妤殑 [00:00:08.400 --> 00:00:10.600] 浠栧€戞妸鐢风敓鐨勯鍨嬫婧 [00:00:10.600 --> 00:00:13.000] 鐢变笁鍏垎瑾垮埌浜嗗叚鍏垎 [00:00:13.000 --> 00:00:16.000] 瑕佷笉鐒跺氨鎴戝啀娆℃斁涓€缍插ソ澶氫汉璁€鏂囬潻 [00:00:16.000 --> 00:00:17.600] 鍗充娇鐝惧湪鏄畝鏂囬潻涔熶笉鍚堟牸 [00:00:17.600 --> 00:00:19.000] 鐒℃剰寰屽洖浜嗙Ξ鎷 [00:00:19.000 --> 00:00:21.000] 閭f垜瀹e竷涓€涓嬮€欏牬姣旇辰绲愭灉 [00:00:21.000 --> 00:00:26.600] 鐛插緱鏈€浣宠畩鎴愮殑鏄鏂逛簩璁婂叏绁 whisper_full_with_state: progress = 5% whisper_full_with_state: progress = 10% whisper_full_with_state: progress = 15% whisper_full_with_state: progress = 20% whisper_full_with_state: progress = 25% whisper_full_with_state: progress = 30% whisper_full_with_state: progress = 35% whisper_full_with_state: progress = 40% whisper_full_with_state: progress = 45% [00:00:26.600 --> 00:00:31.600] (鎺岃伈) [00:00:31.600 --> 00:00:43.000] 鏈€寰岀嵅鍕濈殑鏄鏂 [00:00:43.000 --> 00:00:50.600] (鎺岃伈) [00:00:50.600 --> 00:00:52.200] 閭g従鍦ㄦ槸涓嶆槸鏂囬潻 [00:00:52.200 --> 00:00:53.400] 鏂囬潻 [00:00:53.400 --> 00:00:54.400] 閭e挶鍊戞墦 whisper_full_with_state: progress = 50% whisper_full_with_state: progress = 55% whisper_full_with_state: progress = 60% whisper_full_with_state: progress = 65% whisper_full_with_state: progress = 70% whisper_full_with_state: progress = 75% whisper_full_with_state: progress = 80% whisper_full_with_state: progress = 85% whisper_full_with_state: progress = 90% whisper_full_with_state: progress = 95% output_csv: saving output to '001-small-cn.csv' whisper_print_timings: load time = 360.42 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 132.76 ms whisper_print_timings: sample time = 107.02 ms / 186 runs ( 0.58 ms per run) whisper_print_timings: encode time = 8239.76 ms / 3 runs ( 2746.59 ms per run) whisper_print_timings: decode time = 7407.09 ms / 187 runs ( 39.61 ms per run) whisper_print_timings: total time = 16274.10 ms shenjack whisper-bin-x64 ➜ ( master) ♥ 23:28 chcp 活动代码页: 65001
use base model, same issue
#1151