whisper.cpp UTF8 issue with command line parameters in Windows version

If I pass the file "Chinese audio (中文).mp3" to the windows command line version, it exits with an errors:

rem Here main.exe has been renamed to whisper.exe
C:\...\whisp>whisper.exe --model models\ggml-tiny.bin --language chinese "Chinese file (中文).mp3"
whisper_init_from_file: loading model from 'models\ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: kv self size  =    2.62 MB
whisper_model_load: kv cross size =    8.79 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
error: failed to open 'Chinese file (??).mp3' as WAV file
error: failed to read WAV file 'Chinese file (??).mp3'

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   398.52 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   399.54 ms

Runs fine when I rename the file omitting the chinese logograms. I've also tried setting the codepage to UTF-8 with chcp 65001 with no luck.

(MacOS version works fine)

Mar 02 '23 12:03 bilo1967

whisper.cpp does not support .mp3 files. The input has to be 16 kHz WAV

Mar 06 '23 17:03 ggerganov

whisper.cpp does not support .mp3 files. The input has to be 16 kHz WAV

Yes, apologies, I copied the wrong output example.

This one is with a 16KHz wav file.

The problem are IMHO unicode characters in the file name:

D:\Tools\WhisperGUI>bin\whisper.exe --model models\ggml-tiny.en.bin --language en ..\file_中文.wav
whisper_init_from_file: loading model from 'models\ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: kv self size  =    2.62 MB
whisper_model_load: kv cross size =    8.79 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
error: failed to open '..\file_??.wav' as WAV file
error: failed to read WAV file '..\file_??.wav'

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   494.26 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   495.03 ms

D:\Tools\WhisperGUI>copy ..\file_中文.wav ..\file.wav
        1 file copiati.

D:\Tools\WhisperGUI>bin\whisper.exe --model models\ggml-tiny.en.bin --language en ..\file.wav
whisper_init_from_file: loading model from 'models\ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: kv self size  =    2.62 MB
whisper_model_load: kv cross size =    8.79 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing '..\file.wav' (63793110 samples, 3987.1 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:04.000]   [speaking in foreign language]
[00:00:04.000 --> 00:00:08.000]   [speaking in foreign language]

^C

This happens for the windows version only.

Mar 06 '23 20:03 bilo1967

While the output on CMD terminal is correct, if you run chcp 65001, the bug is present (maybe just in parsing argv?) with --output-file, also:

D:\Tools\WhisperGUI>bin\main.exe  --language chinese --model "models\ggml-tiny.bin" --output-file "D:\Lavori\result (中文)" -osrt -ovtt -otxt "D:\Lavori\5min.wav" 
whisper_init_from_file: loading model from 'models\ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: kv self size  =    2.62 MB
whisper_model_load: kv cross size =    8.79 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing 'D:\Lavori\5min.wav' (4709599 samples, 294.3 sec), 4 threads, 1 processors, lang = chinese, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:02.000]  我現在回想
[00:00:02.000 --> 00:00:04.000]  我做信用電話
[00:00:04.000 --> 00:00:06.000]  事大美說是畢業以後
[00:00:06.000 --> 00:00:07.000]  我也錯過
[00:00:07.000 --> 00:00:10.000]  可是那個時候很自然人
- snip -
[00:04:44.080 --> 00:04:47.080]  这么多年来原门酒店天下
[00:04:47.080 --> 00:04:49.080]  加统夫罚但都问
[00:04:49.080 --> 00:04:51.080]  这么漂亮的字是谁写的
[00:04:51.080 --> 00:04:53.080]  是
[00:04:53.080 --> 00:04:54.280]  312歲的

output_txt: failed to open 'D:\Lavori\result (??).txt' for writing
output_vtt: failed to open 'D:\Lavori\result (??).vtt' for writing
output_srt: failed to open 'D:\Lavori\result (??).srt' for writing

whisper_print_timings:     fallbacks =  17 p /  33 h
whisper_print_timings:     load time =   486.94 ms
whisper_print_timings:      mel time =  2188.15 ms
whisper_print_timings:   sample time = 15614.09 ms /  7737 runs (    2.02 ms per run)
whisper_print_timings:   encode time = 14130.27 ms /    14 runs ( 1009.31 ms per run)
whisper_print_timings:   decode time = 40749.71 ms /  7685 runs (    5.30 ms per run)
whisper_print_timings:    total time = 73315.52 ms

Mar 13 '23 11:03 bilo1967

hmmmmm even with chcp 65001 it still get bugged

shenjack  whisper-bin-x64  ➜ ( master)  ♥ 23:26  .\main.exe -m .\ggml-small.bin -f .\001.wav -t 12 -ocsv -of 001-small-cn -l auto -pp
whisper_init_from_file_no_state: loading model from '.\ggml-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem required  =  608.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.56 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB

system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | COREML = 0 |

main: processing '.\001.wav' (870741 samples, 54.4 sec), 12 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: zh (p = 0.964870)

[00:00:00.000 --> 00:00:01.800]  涓夋閮芥槸鍦ㄥ鏍′腑涓
[00:00:01.800 --> 00:00:06.000]  鎵嶆渻鏈変簡灏嶆柤瀛哥敓楂瀷绠＄悊鐨勯€欏€嬫浠剁殑鏀
[00:00:06.000 --> 00:00:08.400]  灏辨槸鍦ㄤ笂涓婂眴鍓涚暍妤殑
[00:00:08.400 --> 00:00:10.600]  浠栧€戞妸鐢风敓鐨勯鍨嬫婧
[00:00:10.600 --> 00:00:13.000]  鐢变笁鍏垎瑾垮埌浜嗗叚鍏垎
[00:00:13.000 --> 00:00:16.000]  瑕佷笉鐒跺氨鎴戝啀娆℃斁涓€缍插ソ澶氫汉璁€鏂囬潻
[00:00:16.000 --> 00:00:17.600]  鍗充娇鐝惧湪鏄畝鏂囬潻涔熶笉鍚堟牸
[00:00:17.600 --> 00:00:19.000]  鐒℃剰寰屽洖浜嗙Ξ鎷
[00:00:19.000 --> 00:00:21.000]  閭ｆ垜瀹ｅ竷涓€涓嬮€欏牬姣旇辰绲愭灉
[00:00:21.000 --> 00:00:26.600]  鐛插緱鏈€浣宠畩鎴愮殑鏄鏂逛簩璁婂叏绁
whisper_full_with_state: progress =   5%
whisper_full_with_state: progress =  10%
whisper_full_with_state: progress =  15%
whisper_full_with_state: progress =  20%
whisper_full_with_state: progress =  25%
whisper_full_with_state: progress =  30%
whisper_full_with_state: progress =  35%
whisper_full_with_state: progress =  40%
whisper_full_with_state: progress =  45%
[00:00:26.600 --> 00:00:31.600]  (鎺岃伈)
[00:00:31.600 --> 00:00:43.000]  鏈€寰岀嵅鍕濈殑鏄鏂
[00:00:43.000 --> 00:00:50.600]  (鎺岃伈)
[00:00:50.600 --> 00:00:52.200]  閭ｇ従鍦ㄦ槸涓嶆槸鏂囬潻
[00:00:52.200 --> 00:00:53.400]  鏂囬潻
[00:00:53.400 --> 00:00:54.400]  閭ｅ挶鍊戞墦
whisper_full_with_state: progress =  50%
whisper_full_with_state: progress =  55%
whisper_full_with_state: progress =  60%
whisper_full_with_state: progress =  65%
whisper_full_with_state: progress =  70%
whisper_full_with_state: progress =  75%
whisper_full_with_state: progress =  80%
whisper_full_with_state: progress =  85%
whisper_full_with_state: progress =  90%
whisper_full_with_state: progress =  95%

output_csv: saving output to '001-small-cn.csv'

whisper_print_timings:     load time =   360.42 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.76 ms
whisper_print_timings:   sample time =   107.02 ms /   186 runs (    0.58 ms per run)
whisper_print_timings:   encode time =  8239.76 ms /     3 runs ( 2746.59 ms per run)
whisper_print_timings:   decode time =  7407.09 ms /   187 runs (   39.61 ms per run)
whisper_print_timings:    total time = 16274.10 ms
shenjack  whisper-bin-x64  ➜ ( master)  ♥ 23:28  chcp
活动代码页: 65001

Apr 18 '23 15:04 shenjackyuanjie

I use the main.exe in release 1.3.0 win-x64

Apr 18 '23 15:04 shenjackyuanjie

hmmmmm even with chcp 65001 it still get bugged

shenjack  whisper-bin-x64  ➜ ( master)  ♥ 23:26  .\main.exe -m .\ggml-small.bin -f .\001.wav -t 12 -ocsv -of 001-small-cn -l auto -pp
whisper_init_from_file_no_state: loading model from '.\ggml-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem required  =  608.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.56 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB

system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | COREML = 0 |

main: processing '.\001.wav' (870741 samples, 54.4 sec), 12 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: zh (p = 0.964870)

[00:00:00.000 --> 00:00:01.800]  涓夋閮芥槸鍦ㄥ鏍′腑涓
[00:00:01.800 --> 00:00:06.000]  鎵嶆渻鏈変簡灏嶆柤瀛哥敓楂瀷绠＄悊鐨勯€欏€嬫浠剁殑鏀
[00:00:06.000 --> 00:00:08.400]  灏辨槸鍦ㄤ笂涓婂眴鍓涚暍妤殑
[00:00:08.400 --> 00:00:10.600]  浠栧€戞妸鐢风敓鐨勯鍨嬫婧
[00:00:10.600 --> 00:00:13.000]  鐢变笁鍏垎瑾垮埌浜嗗叚鍏垎
[00:00:13.000 --> 00:00:16.000]  瑕佷笉鐒跺氨鎴戝啀娆℃斁涓€缍插ソ澶氫汉璁€鏂囬潻
[00:00:16.000 --> 00:00:17.600]  鍗充娇鐝惧湪鏄畝鏂囬潻涔熶笉鍚堟牸
[00:00:17.600 --> 00:00:19.000]  鐒℃剰寰屽洖浜嗙Ξ鎷
[00:00:19.000 --> 00:00:21.000]  閭ｆ垜瀹ｅ竷涓€涓嬮€欏牬姣旇辰绲愭灉
[00:00:21.000 --> 00:00:26.600]  鐛插緱鏈€浣宠畩鎴愮殑鏄鏂逛簩璁婂叏绁
whisper_full_with_state: progress =   5%
whisper_full_with_state: progress =  10%
whisper_full_with_state: progress =  15%
whisper_full_with_state: progress =  20%
whisper_full_with_state: progress =  25%
whisper_full_with_state: progress =  30%
whisper_full_with_state: progress =  35%
whisper_full_with_state: progress =  40%
whisper_full_with_state: progress =  45%
[00:00:26.600 --> 00:00:31.600]  (鎺岃伈)
[00:00:31.600 --> 00:00:43.000]  鏈€寰岀嵅鍕濈殑鏄鏂
[00:00:43.000 --> 00:00:50.600]  (鎺岃伈)
[00:00:50.600 --> 00:00:52.200]  閭ｇ従鍦ㄦ槸涓嶆槸鏂囬潻
[00:00:52.200 --> 00:00:53.400]  鏂囬潻
[00:00:53.400 --> 00:00:54.400]  閭ｅ挶鍊戞墦
whisper_full_with_state: progress =  50%
whisper_full_with_state: progress =  55%
whisper_full_with_state: progress =  60%
whisper_full_with_state: progress =  65%
whisper_full_with_state: progress =  70%
whisper_full_with_state: progress =  75%
whisper_full_with_state: progress =  80%
whisper_full_with_state: progress =  85%
whisper_full_with_state: progress =  90%
whisper_full_with_state: progress =  95%

output_csv: saving output to '001-small-cn.csv'

whisper_print_timings:     load time =   360.42 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.76 ms
whisper_print_timings:   sample time =   107.02 ms /   186 runs (    0.58 ms per run)
whisper_print_timings:   encode time =  8239.76 ms /     3 runs ( 2746.59 ms per run)
whisper_print_timings:   decode time =  7407.09 ms /   187 runs (   39.61 ms per run)
whisper_print_timings:    total time = 16274.10 ms
shenjack  whisper-bin-x64  ➜ ( master)  ♥ 23:28  chcp
活动代码页: 65001

use base model, same issue

Apr 20 '23 07:04 DoodleBears

#1151

Aug 05 '23 12:08 bobqianic

whisper.cpp whisper.cpp copied to clipboard

UTF8 issue with command line parameters in Windows version

whisper.cpp
whisper.cpp copied to clipboard