whisper.cpp Fix the decoding issues

Fix the decoding issues

Open bobqianic opened this issue 5 months ago • 33 comments

[x] Basic functionality
[x] Rewrite whisper_wrap_segment
[x] Rewrite L5717-L5805
[x] ~Remove print_realtime~ This is too tricky
[x] Remove hallucination by using token_nosp
[x] Heuristic hallucination detection (Basic implementation)
[x] Disable beam search when $temperature>0$
[x] Fix tokenizer
[x] Fix audio feature seeking mechanism
[x] ~Use compression ratio instead of entropy~ Will be addressed in separate PRs
[ ] Code cleanup

Jan 14 '24 15:01 bobqianic

I wonder why the CJK token in json output by openai/whisper don't split. Is it because openai/whisper did some afterprocessing of the tokens?

Jan 24 '24 14:01 HaujetZhao

I wonder why the CJK token in json output by openai/whisper don't split. Is it because openai/whisper did some afterprocessing of the tokens?

Yes. See this

Jan 24 '24 14:01 bobqianic

I'm really busy this week, but I should have some free time next week. I can probably finish and merge this PR into the master branch by the end of next week.

Jan 24 '24 14:01 bobqianic

I'm really busy this week, but I should have some free time next week. I can probably finish and merge this PR into the master branch by the end of next week.

Could this pull fix the hallucination or repetion produced by large-v3?

Jan 24 '24 15:01 AeneasZhu

Could this pull fix the hallucination or repetion produced by large-v3?

While this Heuristic hallucination detection might reduce hallucinations and repetitions to some extent, it's unlikely to eliminate them entirely when using large-v3.

Jan 24 '24 15:01 bobqianic

Could this pull fix the hallucination or repetion produced by large-v3?

While this Heuristic hallucination detection might reduce hallucinations and repetitions to some extent, it's unlikely to eliminate them entirely when using large-v3.

Do you have any idea why large-v3 prouduces much more hallucination than its previous versions? And could there be any solution to this problem by fine-tuning parameters like beam, mc?

Jan 24 '24 15:01 AeneasZhu

Do you have any idea why large-v3 prouduces much more hallucination than its previous versions? And could there be any solution to this problem by fine-tuning parameters like beam, mc?

I'm continuing my investigation into the cause, but so far, I haven't been able to uncover any clues.

See this https://deepgram.com/learn/whisper-v3-results

Jan 24 '24 15:01 bobqianic

Due to the need to integrate zlib for calculating the compression ratio, Use Compression Ratio Instead of Entropy and Heuristic Hallucination Detection will be addressed in separate PRs. This approach will allow for more thorough discussions. See https://github.com/ggerganov/whisper.cpp/discussions/1461#discussioncomment-8340428

Feb 02 '24 21:02 bobqianic

Due to the need to integrate zlib for calculating the compression ratio, Use Compression Ratio Instead of Entropy and Heuristic Hallucination Detection will be addressed in separate PRs. This approach will allow for more thorough discussions. See #1461 (reply in thread)

Is this PR almost finished? I can't wait to see the outcome. By the way, shall I just download whisper.cpp from the master when this PR is finished. Will you merge it into master?

Feb 03 '24 03:02 AeneasZhu

Changelog:

whisper.cpp:

The GPT2-style Byte Pair Encoding (BPE) tokenizer has been updated with a custom regex implementation to enhance accuracy, replacing the previous std::regex usage. This change aims to ensure more reliable tokenization, as detailed in a recent GitHub issue.
Improvements have also been made to the audio feature seeking mechanism, aligning it with OpenAI's methodology. Notably, when encountering a single timestamp token at the end of an audio segment, the system now automatically progresses to the next segment boundary. See here.
A new feature, token_nosp, has been introduced to identify silent audio segments. Segments where token_nosp's probability (non_speech_probs) exceeds a certain threshold and the average log probability (avg_logprobs) falls below another threshold are considered silent and are skipped, moving directly to the following segment boundary.
The beam search settings have undergone a thorough review, with no issues identified at this time.
Three new APIs whisper_full_get_segment_no_speech_probs, whisper_full_get_segment_no_speech_probs_from_state, and whisper_utf8_is_valid have been introduced.
The whisper_wrap_segment function has been refined to guarantee that each segment contains valid UTF-8 text, even when the maximum length parameter (max_len) is set to 1.
The code related to the segment callback has been rewritten to make it easier to understand.
The split_on_word option has been removed from whisper_full_params because we now always split on words.
The suppress_non_speech_tokens option is now set to true by default.

main.cpp:

We've introduced two new options, --nospeech-thold and --suppress-nst, while removing the --split-on-word option.
By default, we're now enabling the UTF-8 code page on windows, achieved through the introduction of console.h. This change is aimed at enhancing compatibility and functionality across various operating systems.
To address issues with the Windows API, where UTF-8 encoding can lead to corruption of CJK pathnames, we've implemented a solution. Inputs are now accepted as wchar_t UTF-16 strings, which are then converted to char UTF-8 strings on Windows platforms. This approach effectively circumvents the aforementioned problem.
When opening ofstream, we convert UTF-8 char strings to UTF-16 wchar_t strings. This conversion is another step towards resolving compatibility issues with the Windows API, ensuring that CJK pathnames are handled correctly and without errors.

stream.cpp:

Similar to 2, 3, 4 in main.cpp.

talk.cpp:

Similar to 2, 3 in main.cpp.

talk-llama.cpp:

Similar to 2, 3 in main.cpp.

command.cpp:

Similar to 2, 3 in main.cpp.

common.cpp:

Similar to 2, 4 in main.cpp.

server.cpp:

We've removed the --split-on-word option.

@ggerganov

Feb 05 '24 17:02 bobqianic

Test results

Tips for collapsed sections

Before:

Filename: 01-03（轻松学中文+第二版+课本2）.wav Language: zh

C:\Users\qianp\Downloads\whisper-cublas-11.8.0-bin-x64>C:\Users\qianp\Downloads\whisper-cublas-11.8.0-bin-x64\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-large.bin -f C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03（轻松学中文+第二版+课本2）.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094.49 MB (3 buffers)
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   33.91 MB
whisper_init_state: compute buffer (encode) =  233.50 MB
whisper_init_state: compute buffer (cross)  =   10.15 MB
whisper_init_state: compute buffer (decode) =  108.99 MB
error: failed to open 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03(?????+???+??2).wav' as WAV file
error: failed to read WAV file 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03(?????+???+??2).wav'

whisper_print_timings:     load time =  6602.47 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  6608.45 ms

Filename: chinese.wav Language: zh

C:\Users\qianp\Downloads\whisper-cublas-11.8.0-bin-x64>C:\Users\qianp\Downloads\whisper-cublas-11.8.0-bin-x64\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-large.bin -f C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\chinese.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094.49 MB (3 buffers)
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   33.91 MB
whisper_init_state: compute buffer (encode) =  233.50 MB
whisper_init_state: compute buffer (cross)  =   10.15 MB
whisper_init_state: compute buffer (decode) =  108.99 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

main: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\chinese.wav' (713721 samples, 44.6 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: zh (p = 0.994848)

[00:00:00.000 --> 00:00:04.800]  Σ╜áτê╖τê╖σÑ╢σÑ╢Σ╜Åσ£¿σô¬σä┐?
[00:00:04.800 --> 00:00:07.800]  Σ╗ûΣ╗¼Σ╜Åσ£¿σìùΣ║¼πÇé
[00:00:07.800 --> 00:00:12.800]  Σ╗ûΣ╗¼Φ╖ƒµêæσÅöσÅöσÆîσ⌐╢σ⌐╢Σ╕ÇΦ╡╖Σ╜ÅπÇé
[00:00:12.800 --> 00:00:16.800]  Σ╜áσÅöσÅöµÿ»σô¬Σ╕Çσ╣┤τ╗ôσ⌐ÜτÜä?
[00:00:16.800 --> 00:00:19.800]  Σ╗ûµÿ»σÄ╗σ╣┤τ╗ôσ⌐ÜτÜäπÇé
[00:00:19.800 --> 00:00:23.800]  Σ╜áσÅöσÅöσüÜΣ╗ÇΣ╣êσ╖ÑΣ╜£?
[00:00:23.800 --> 00:00:25.800]  σ£¿σô¬σä┐σ╖ÑΣ╜£?
[00:00:25.800 --> 00:00:27.800]  µêæσÅöσÅöµÿ»ΦÇüσ╕êπÇé
[00:00:28.000 --> 00:00:31.800]  Σ╗ûσ£¿σÉîσÉìΣ╕¡σ¡ªσ╖ÑΣ╜£πÇé
[00:00:31.800 --> 00:00:37.800]  Σ╜áσ╕╕σ╕╕Φ╖ƒΣ╜áτê╕τê╕σ«╢τÜäΣ║▓µêÜΦºüΘ¥óσÉù?
[00:00:37.800 --> 00:00:40.800]  µêæσ╕╕σ╕╕Φ╖ƒΣ╗ûΣ╗¼ΦºüΘ¥óπÇé
[00:00:40.800 --> 00:00:42.800]  ΦªïΘ¥ó
[00:00:42.800 --> 00:00:44.800]  µêæσÅ¬µâ│τƒÑΘüôΣ╜áσê░σ║òµÿ»Φ░üσòè?


whisper_print_timings:     load time =  4116.39 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =    34.72 ms
whisper_print_timings:   sample time =   334.78 ms /   744 runs (    0.45 ms per run)
whisper_print_timings:   encode time =  1144.36 ms /     5 runs (  228.87 ms per run)
whisper_print_timings:   decode time =   226.45 ms /     5 runs (   45.29 ms per run)
whisper_print_timings:   batchd time =  3613.09 ms /   727 runs (    4.97 ms per run)
whisper_print_timings:   prompt time =   186.71 ms /    88 runs (    2.12 ms per run)
whisper_print_timings:    total time =  9686.97 ms

Filename: chinese.wav Language: zh + chcp 65001

C:\Users\qianp\Downloads\whisper-cublas-11.8.0-bin-x64>C:\Users\qianp\Downloads\whisper-cublas-11.8.0-bin-x64\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-large.bin -f C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\chinese.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094.49 MB (3 buffers)
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   33.91 MB
whisper_init_state: compute buffer (encode) =  233.50 MB
whisper_init_state: compute buffer (cross)  =   10.15 MB
whisper_init_state: compute buffer (decode) =  108.99 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

main: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\chinese.wav' (713721 samples, 44.6 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: zh (p = 0.994848)

[00:00:00.000 --> 00:00:04.800]  你爷爷奶奶住在哪儿?
[00:00:04.800 --> 00:00:07.800]  他们住在南京。
[00:00:07.800 --> 00:00:12.800]  他们跟我叔叔和婶婶一起住。
[00:00:12.800 --> 00:00:16.800]  你叔叔是哪一年结婚的?
[00:00:16.800 --> 00:00:19.800]  他是去年结婚的。
[00:00:19.800 --> 00:00:23.800]  你叔叔做什么工作?
[00:00:23.800 --> 00:00:25.800]  在哪儿工作?
[00:00:25.800 --> 00:00:27.800]  我叔叔是老师。
[00:00:28.000 --> 00:00:31.800]  他在同名中学工作。
[00:00:31.800 --> 00:00:37.800]  你常常跟你爸爸家的亲戚见面吗?
[00:00:37.800 --> 00:00:40.800]  我常常跟他们见面。
[00:00:40.800 --> 00:00:42.800]  見面
[00:00:42.800 --> 00:00:44.800]  我只想知道你到底是谁啊?


whisper_print_timings:     load time =  4031.71 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =    31.19 ms
whisper_print_timings:   sample time =   355.73 ms /   744 runs (    0.48 ms per run)
whisper_print_timings:   encode time =  1148.79 ms /     5 runs (  229.76 ms per run)
whisper_print_timings:   decode time =   227.00 ms /     5 runs (   45.40 ms per run)
whisper_print_timings:   batchd time =  3931.32 ms /   727 runs (    5.41 ms per run)
whisper_print_timings:   prompt time =   190.18 ms /    88 runs (    2.16 ms per run)
whisper_print_timings:    total time =  9930.92 ms

Filename: chinese.wav Language: zh + chcp 65001 + --print-colors

Filename: chinese.wav Language: zh + chcp 65001 + --print-colors + --max-len 1

After:

Filename: 01-03（轻松学中文+第二版+课本2）.wav Language: zh

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-large.bin -f C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03（轻松学中文+第二版+课本2）.wav -l auto
whisper_init_from_file_with_params_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094.49 MB (3 buffers)
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   33.91 MB
whisper_init_state: compute buffer (encode) =  233.50 MB
whisper_init_state: compute buffer (cross)  =   10.15 MB
whisper_init_state: compute buffer (decode) =  108.99 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

run: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03（轻松学中文+第二版+课本2）.wav' (713721 samples, 44.6 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: zh (p = 0.994848)

[00:00:01.000 --> 00:00:05.000]  你爷爷奶奶住在哪儿?
[00:00:05.000 --> 00:00:08.000]  他们住在南京。
[00:00:08.000 --> 00:00:13.000]  他们跟我叔叔和婶婶一起住。
[00:00:13.000 --> 00:00:17.000]  你叔叔是哪一年结婚的?
[00:00:17.000 --> 00:00:20.000]  他是去年结婚的。
[00:00:20.000 --> 00:00:24.000]  你叔叔做什么工作?
[00:00:24.000 --> 00:00:26.000]  在哪儿工作?
[00:00:26.000 --> 00:00:28.000]  我叔叔是老师。
[00:00:28.000 --> 00:00:32.000]  他在同名中学工作。
[00:00:32.000 --> 00:00:38.000]  你常常跟你爸爸家的亲戚见面吗?
[00:00:38.000 --> 00:00:41.000]  我常常跟他们见面。


whisper_print_timings:     load time =  4086.30 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    38.71 ms
whisper_print_timings:   sample time =   284.85 ms /   608 runs (    0.47 ms per run)
whisper_print_timings:   encode time =   692.73 ms /     3 runs (  230.91 ms per run)
whisper_print_timings:   decode time =   167.72 ms /     1 runs (  167.72 ms per run)
whisper_print_timings:   batchd time =  2644.88 ms /   603 runs (    4.39 ms per run)
whisper_print_timings:   prompt time =   195.07 ms /    89 runs (    2.19 ms per run)
whisper_print_timings:    total time =  8294.52 ms

Filename: 01-03（轻松学中文+第二版+课本2）.wav Language: zh + chcp 65001 + --print-colors

Filename: 01-03（轻松学中文+第二版+课本2）.wav Language: zh + chcp 65001 + --print-colors + --max-len 1

Feb 05 '24 18:02 bobqianic

Does this change have any positive effect for whisper-v3 or it's still repeating stuff?

Feb 06 '24 13:02 ggerganov

Does this change have any positive effect for whisper-v3 or it's still repeating stuff?

Overall, there is a positive effect on all models. However, given the limitations of Whisper v3, any improvements made to it still fall short of expectations. See the analysis provided by Deepgram: Whisper v3 Results.

Feb 06 '24 14:02 bobqianic

I think that to completely avoid hallucination, the best approach is similar to using DTW to calculate token timestamps. By comparing these with cross-attention weights, we can definitely identify anomalies if there are any hallucinations.

Feb 06 '24 15:02 bobqianic

Did you run some tests?

Feb 08 '24 15:02 ggerganov

Did you run some tests?

I've done some initial testing, and the results are promising. However, I need a bit more time to conduct a comprehensive analysis. You can already notice the difference by testing a few audio files. Currently, I'm downloading the Common Voice Corpus 15.0, which is over 100GB, so completing the testing will take a little while.

There is a person who sent me a test file via Discord. Running large-v2 with master will generate a lot of duplicate content, but using this PR will be much better. The file is copyrighted, so I cannot make it public, but you can ask him for it privately.

https://github.com/ggerganov/whisper.cpp/issues/1724#issuecomment-1898989928

Feb 08 '24 16:02 bobqianic

@bobqianic I'm very appreciative of this work and very excited to see this branch implemented, but getting some bad results with weird non-speech tokens at the beginning of many files, this problem does not happen in master branch.

Example 1:

wav file: https://www.dropbox.com/scl/fi/bdz7lx4khunq3kiauyus8/shermer.wav?rlkey=hzy02rkewjb4pwoamp9whch4b&dl=0

Command:

./main -m ggml-largev2.bin -f shermer.wav

Output of master branch @ 434b8f3b (current):

[00:00:00.000 --> 00:00:09.000] [music] [00:00:09.000 --> 00:00:12.000] [applause] [00:00:12.000 --> 00:00:14.000] Hey, I am Michael Shermer, the director of the Skeptic Society, ...

Output of this PR @ c0277e3:

[00:00:00.000 --> 00:00:07.000] Transcriber's Name Reviewer's Name [00:00:12.340 --> 00:00:14.300] I am Michael Shermer, the director of the Skeptic Society, ...

Example 2 with translate fr to en:

wav file: https://www.dropbox.com/scl/fi/1go0yxkr10vwhfyxs76vz/french.wav?rlkey=312gc5qmw3r31ovh003410hyb&dl=0

Command:

./main -m ggml-largev2.bin -f french.wav -l fr -tr

Output of master branch @ 434b8f3b (current):

[00:00:00.000 --> 00:00:04.000] (Music) [00:00:04.000 --> 00:00:07.000] (Applause) [00:00:07.000 --> 00:00:20.000] I am a champion of France. ...

Output of this PR @ c0277e3:

[00:00:00.000 --> 00:00:17.000] Translation & subtitling by Quentin Dewaghe Traduction & sous-titrage par Quentin Dewaghe q.dewaghe.com [00:00:17.000 --> 00:00:20.000] I'm a champion of France. ...

Any idea why these non-speech tokens like "Transcriber's Name Reviewer's Name" are being output as speech at the beginning? Thanks again.

Feb 08 '24 17:02 jettoblack

Any idea why these non-speech tokens like "Transcriber's Name Reviewer's Name" are being output as speech at the beginning? Thanks again.

Thank you for letting me know. It seems the primary issue stems from my having suppressed non-speech tokens, which has resulted in symbols like ( and [ having a zero probability of appearing. While this approach enhances the overall quality, it clearly didn't account for situations like yours, which I hadn't anticipated. As mentioned, I'll conduct further tests and explore ways to address this issue.

Feb 08 '24 17:02 bobqianic

@jettoblack I've added a heuristic for detecting repetitive hallucinations, which you can disable via parameters if you prefer. Additionally, I've removed the tokens ( ) [ and ] from the list of tokens to be suppressed, so they will remain unaffected even when suppression mode is enabled.

Output of this PR @ https://github.com/ggerganov/whisper.cpp/pull/1768/commits/476dff454488af4710d4ad5f230fb8a7ac810d6f:

[00:00:00.000 --> 00:00:17.000] [Music] [00:00:17.000 --> 00:00:20.000] I am a champion of France. ...

Feb 09 '24 18:02 bobqianic

@bobqianic The repetition heuristic seems to be working well so far. I'm seeing fewer hallucinations on silent intervals. I looked at the code and this is unrelated to the non-speech token changes, right?

I'm not so sure about the non-speech token changes. With your latest commit I see fewer cases of the problem I mentioned above, but it's still happening a lot. One example I got just now in the sg1.wav file I sent you previously on Discord:

A hallucination like ♪♪ or repeated text is far less objectionable than someone else's copyright notice or translator's notes which is what I'm getting a lot of.

This change also removes many useful tokens from the output, like quotation marks and music notes. Using the -nsnst option restores these tokens but that causes this issue to be much worse, and I've caught a lot more cases of it occurring in many files, including in the middle of files not just the beginning. If these were the only two options I'd leave suppression enabled, but master branch includes these useful tokens without this hallucination problem.

It might be helpful to compare the output of a branch with the other fixes of this PR excluding the non-speech token changes, or at least have a way to turn those completely off and go back to master branch behavior.

Feb 10 '24 03:02 jettoblack

I looked at the code and this is unrelated to the non-speech token changes, right?

Yes. In situations where the model exhibits hallucinations with high confidence (avg_log_probs), this non-speech token approach will not be effective. The heuristic repetition check that I've implemented serves as a workaround for the compression ratio check. Implementing compression in C++ can be challenging without using third-party libraries. In the official implementation by OpenAI, both the compression ratio and non-speech tokens anti-hallucination mechanisms are utilized.

[00:57:11.700 --> 00:57:14.700] (c) 2014 University of Georgia College of Agricultural and Environmental Sciences UGA Extension Office of Communications and Creative Services

Which branch are you using? I can't find the hallucinations you mentioned.

large-v2

Feb 10 '24 13:02 bobqianic

Which branch are you using? I can't find the hallucinations you mentioned.

I was using this PR @ 476dff4, unless I did something wrong, but this was on a Mac using the Metal gpu backend so that could make a difference. I'll retest on CPU and CUDA shortly and let you know.

Feb 12 '24 19:02 jettoblack

Hi!

@bobqianic new version is very robust!

On my test files, main branch emit 10 hallucinations on 26 WAV files (model ggml-large-v2, russian language). With this PR it give only 2 hallucination. It is very fine result!!!

But example/server doesn't work at all, both CPU and CUDA versions. It returns empty text without any errors. I try patch and append parameters (heuristic and other), but it not help. With --print-progress it print progress, but not result text.

Also it give the error on specific file: 500 Internal Server Error map::at

What can we do for fix it, how do you think?

Run server command:

/usr/src/whisper.cpp-bobqianic/server -m ../../models/ggml-large-v2.bin -l ru --print-progress --print-realtime -nt -nf

whisper_init_from_file_with_params_no_state: loading model from '../../models/ggml-large-v2.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094.49 MB (3 buffers)
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   33.91 MB
whisper_init_state: compute buffer (encode) =  233.50 MB
whisper_init_state: compute buffer (cross)  =   10.15 MB
whisper_init_state: compute buffer (decode) =  108.99 MB

whisper server listening at http://127.0.0.1:8080

Received request: 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-14.wav
Successfully loaded 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-14.wav

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

operator(): processing '0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-14.wav' (168960 samples, 10.6 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 0 ...

Running whisper.cpp inference on 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-14.wav
Received request: 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-15.wav
Successfully loaded 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-15.wav

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

operator(): processing '0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-15.wav' (235200 samples, 14.7 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 0 ...

Running whisper.cpp inference on 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-15.wav

whisper_print_progress_callback: progress = 204%
Received request: 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-16.wav
Successfully loaded 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-16.wav

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

operator(): processing '0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-16.wav' (512000 samples, 32.0 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 0 ...

Running whisper.cpp inference on 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-16.wav

whisper_print_progress_callback: progress =  93%

whisper_print_progress_callback: progress = 187%
Received request: 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-18.wav
Successfully loaded 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-18.wav

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

operator(): processing '0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-18.wav' (115520 samples, 7.2 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 0 ...

Running whisper.cpp inference on 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-18.wav

whisper_print_progress_callback: progress = 416%
...

Send file command:

curl localhost:8080/inference -H "Content-Type: multipart/form-data" -F file="@${filename}"

git diff result:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index cf0157d..5030e87 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -64,6 +64,7 @@ struct whisper_params {
     float word_thold      =  0.01f;
     float entropy_thold   =  2.40f;
     float logprob_thold   = -1.00f;
+    float no_speech_thold =  0.60f;
     float temperature     =  0.00f;
     float temperature_inc =  0.20f;
 
@@ -78,6 +79,8 @@ struct whisper_params {
     bool print_realtime  = false;
     bool print_progress  = false;
     bool no_timestamps   = false;
+    bool suppress_nst    = true;  // suppress non speech tokens
+    bool heuristic       = true;
     bool use_gpu         = true;
 
     std::string language        = "en";
@@ -183,7 +186,10 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params, serve
         else if (arg == "-wt"   || arg == "--word-thold")      { params.word_thold      = std::stof(argv[++i]); }
         else if (arg == "-et"   || arg == "--entropy-thold")   { params.entropy_thold   = std::stof(argv[++i]); }
         else if (arg == "-lpt"  || arg == "--logprob-thold")   { params.logprob_thold   = std::stof(argv[++i]); }
+        else if (arg == "-nst"  || arg == "--nospeech-thold")  { params.no_speech_thold = std::stof(argv[++i]); }
         // else if (arg == "-su"   || arg == "--speed-up")        { params.speed_up        = true; }
+        else if (arg == "-nsnst"|| arg == "--no-suppress-nst") { params.suppress_nst    = false; }
+        else if (arg == "-nh"   || arg == "--no-heuristic")    { params.heuristic       = false; }
         else if (arg == "-tr"   || arg == "--translate")       { params.translate       = true; }
         else if (arg == "-di"   || arg == "--diarize")         { params.diarize         = true; }
         else if (arg == "-tdrz" || arg == "--tinydiarize")     { params.tinydiarize     = true; }
@@ -726,6 +732,7 @@ int main(int argc, char ** argv) {
             wparams.max_len          = params.max_len == 0 ? 60 : params.max_len;
 
             wparams.speed_up         = params.speed_up;
+wparams.heuristic = params.heuristic;
 
             wparams.tdrz_enable      = params.tinydiarize; // [TDRZ]
 
@@ -738,8 +745,11 @@ int main(int argc, char ** argv) {
             wparams.temperature_inc  = params.temperature_inc;
             wparams.entropy_thold    = params.entropy_thold;
             wparams.logprob_thold    = params.logprob_thold;
+wparams.no_speech_thold = params.no_speech_thold;
 
             wparams.no_timestamps    = params.no_timestamps;
+wparams.suppress_non_speech_tokens = params.suppress_nst;
+
             wparams.token_timestamps = !params.no_timestamps && params.response_format == vjson_format;
 
             whisper_print_user_data user_data = { &params, &pcmf32s, 0 };

Thank you!

Feb 16 '24 20:02 ukolovda

Hello @ukolovda I took a look at this yesterday evening. Whats missing in server.cpp is what you mentioned:

heuristics
supress_nst
no_speech_thold

I got an output in the terminal by circumventing the print_realtime flag(instead of using a callback segment). So the model does in fact generate the output string but for some unknown reason whisper_full_n_segments(ctx) returns 0. I try to check this a bit more tomorrow.

Feb 17 '24 16:02 felrock

I got an output in the terminal by circumventing the print_realtime flag(instead of using a callback segment). So the model does in fact generate the output string but for some unknown reason whisper_full_n_segments(ctx) returns 0.

Hello, @felrock !

Thank you!

Feb 19 '24 12:02 ukolovda

Append issue with zero-filled WAV. https://github.com/ggerganov/whisper.cpp/issues/1881

Feb 20 '24 10:02 ukolovda

File from #1881 (zero filled WAV) give a gallucination in this version too.

$ ../whisper.cpp-bobqianic/main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094,86 MB (3 buffers)
whisper_model_load: model size    = 3094,36 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220,20 MB
whisper_init_state: kv cross size =  245,76 MB
whisper_init_state: compute buffer (conv)   =   35,50 MB
whisper_init_state: compute buffer (encode) =  233,50 MB
whisper_init_state: compute buffer (cross)  =   10,15 MB
whisper_init_state: compute buffer (decode) =  108,99 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

run: processing 'samples/zeroes.wav' (19200 samples, 1,2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:29.980]   Продолжение следует...


whisper_print_timings:     load time =   781,61 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     4,81 ms
whisper_print_timings:   sample time =    28,10 ms /    79 runs (    0,36 ms per run)
whisper_print_timings:   encode time =   162,31 ms /     1 runs (  162,31 ms per run)
whisper_print_timings:   decode time =     0,00 ms /     1 runs (    0,00 ms per run)
whisper_print_timings:   batchd time =   482,89 ms /    77 runs (    6,27 ms per run)
whisper_print_timings:   prompt time =     0,00 ms /     1 runs (    0,00 ms per run)
whisper_print_timings:    total time =  1502,74 ms

Feb 20 '24 13:02 ukolovda

-output-json-full has problems with the output format.

Language: Chinese

Feb 21 '24 08:02 linmi

What's the status of this PR? is it safe to use? I experience decoding issues https://github.com/thewh1teagle/vibe/issues/34

Mar 31 '24 22:03 thewh1teagle

I'm thinking about including this pull request in the R wrapper at audio.whisper . There the current approach to handle some of the hallucinations is to use R packages audio.vadwebrtc or audio.vadsilero to detect silences or general non-voiced signals and either

instead of looping over different files in the main loop, loop over the detected non-silence sections in the audio.
or create a new audio file with only the voiced audio and recompute the timestamps later on by adding what was left out

I haven't looked into the extreme details on this pull request (only skimmed through the logic which was changed in main.cpp and whisper.cpp) but would it make sense already to incorporate this pull request in audio.whisper or are there a lot of changes to be expected here or is this pull request going to be split into a BPE change (https://github.com/ggerganov/whisper.cpp/pull/1854) and a change regarding how to handle non-speech?

Apr 05 '24 07:04 jwijffels

whisper.cpp whisper.cpp copied to clipboard

Fix the decoding issues

Changelog:

Test results

Before:

After:

Example 1:

Command:

Output of master branch @ 434b8f3b (current):

Output of this PR @ c0277e3:

Example 2 with translate fr to en:

Command:

Output of master branch @ 434b8f3b (current):

Output of this PR @ c0277e3:

Output of this PR @ https://github.com/ggerganov/whisper.cpp/pull/1768/commits/476dff454488af4710d4ad5f230fb8a7ac810d6f:

whisper.cpp
whisper.cpp copied to clipboard