whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Change default settings for "true|false" -Proper incantation to --output-txt with timestamps

Open chris-english opened this issue 2 years ago • 3 comments
trafficstars

Consider this command, likely wrong in many ways: ~/whisper.cpp/main -m ~/whisper.cpp/models/ggml-base.en.bin -f track1.wav -ml 1 -nt false --output-txt trk1.txt whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: adding 1607 extra tokens whisper_model_load: mem_required = 506.00 MB whisper_model_load: ggml ctx size = 140.60 MB whisper_model_load: memory size = 22.83 MB whisper_model_load: model size = 140.54 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'track1.wav' (68267 samples, 4.3 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 0 ...

Apple, Penny, Table output_txt: saving output to 'track1.wav.txt' cat track1.wav.txt Apple, Penny, Table

whereas: ~/whisper.cpp/main -m ~/whisper.cpp/models/ggml-base.en.bin -f track5.wav -ml 1 whisper_model_load: loading model from '/home/chris/whisper.cpp/models/ggml-base.en.bin' whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: adding 1607 extra tokens whisper_model_load: mem_required = 506.00 MB whisper_model_load: ggml ctx size = 140.60 MB whisper_model_load: memory size = 22.83 MB whisper_model_load: model size = 140.54 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'track5.wav' (437589 samples, 27.3 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:00.700]
[00:00:00.700 --> 00:00:01.610] Apple [00:00:01.610 --> 00:00:02.220] penny [00:00:02.220 --> 00:00:03.010] table [00:00:03.010 --> 00:00:04.310] . [00:00:04.310 --> 00:00:05.230] Apple [00:00:05.230 --> 00:00:06.150] penny [00:00:06.150 --> 00:00:07.480] table [00:00:07.480 --> 00:00:08.170] . [00:00:08.170 --> 00:00:10.140] Apple [00:00:10.140 --> 00:00:12.020] penny [00:00:12.020 --> 00:00:13.980] table [00:00:13.980 --> 00:00:14.830] . [00:00:14.830 --> 00:00:15.620] Apple [00:00:15.620 --> 00:00:16.050] penny [00:00:16.050 --> 00:00:17.340] table [00:00:17.340 --> 00:00:17.440] . [00:00:17.440 --> 00:00:18.490] Apple [00:00:18.490 --> 00:00:19.390] penny [00:00:19.390 --> 00:00:19.870] table [00:00:19.870 --> 00:00:20.000] . [00:00:20.000 --> 00:00:20.000]
[00:00:20.000 --> 00:00:22.650] Apple [00:00:22.650 --> 00:00:23.340] penny [00:00:23.340 --> 00:00:24.980] table [00:00:24.980 --> 00:00:26.000] . [00:00:26.000 --> 00:00:26.000] Apple [00:00:26.000 --> 00:00:26.000] penny [00:00:26.000 --> 00:00:26.000] table [00:00:26.000 --> 00:00:26.000] . [00:00:26.000 --> 00:00:26.000]
[00:00:26.000 --> 00:00:27.340] Apple [00:00:27.340 --> 00:00:27.340] penny [00:00:27.340 --> 00:00:27.340] table [00:00:27.340 --> 00:00:32.000] .

whisper_print_timings: load time = 1024.15 ms whisper_print_timings: mel time = 772.90 ms whisper_print_timings: sample time = 46.30 ms whisper_print_timings: encode time = 749550.56 ms / 124925.09 ms per layer whisper_print_timings: decode time = 34917.91 ms / 5819.65 ms per layer whisper_print_timings: total time = 786430.44 ms chris@jacie:~/MMSE_audio$

How to return both timestamps and txt with --output-txt Thanks,

chris-english avatar Dec 01 '22 20:12 chris-english

Try:

~/whisper.cpp/main -m ~/whisper.cpp/models/ggml-base.en.bin -f track1.wav -ml 1 -otxt

Text will be in track1.wav.txt

ggerganov avatar Dec 04 '22 07:12 ggerganov

I should have asked, more generally, "How to change default settings for true|false", where no guidance is given in -h, (except for numeric values where params are the next number after the -numeric_param). In this case I wish to override -nt true to -nt false, I want the time stamps, which I can get in .vtt, but would like to get in -otxt. Mention of how to achieve this might make it's way help. Sorry for being previously unclear.

chris-english avatar Dec 04 '22 21:12 chris-english

The default value of -nt is false:

https://github.com/ggerganov/whisper.cpp/blob/363a2dadec242723206e33e67629e691b556c6c5/examples/main/main.cpp#L71

If you add the argument -nt then it's value becomes true.

Note that -nt false, -nt true, -nt 0, -nt 1 are all invalid. You have to use only -nt by itself and this will make it true. If you don't use it, then it is false.

ggerganov avatar Dec 05 '22 15:12 ggerganov

This is all finally clear. Thank you and sorry to trouble.

chris-english avatar Dec 06 '22 14:12 chris-english

... I want the time stamps, which I can get in .vtt, but would like to get in -otxt. Mention of how to achieve this might make it's way help. Sorry for being previously unclear.

@chris-english how are you getting timestamps in the .txt output? I'm only able to get them in the other two styles.

resistor4u avatar Mar 13 '23 16:03 resistor4u