whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

[WIP] Implement configurable duplicate-text removal methods

Open Copilot opened this issue 3 months ago • 0 comments

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.

Original prompt

Goal

Implement two configurable duplicate-text removal methods for the streaming example (examples/stream/stream.cpp):

  • token-level dedupe (preferred): use whisper segment token ids to detect overlap across previously printed text and new segments and avoid printing duplicate tokens
  • character-level dedupe (fallback / simple): normalized longest suffix/prefix match on printed text

Expose a command-line option to choose the dedupe mode and tuning parameters.

Requirements

  1. Add configuration options to whisper_params (examples/stream/stream.cpp):

    • std::string dedupe = "token"; // options: "none", "char", "token"
    • int32_t min_token_overlap = 3; // minimum matched tokens to consider trimming
    • int32_t min_char_overlap = 3; // minimum matched characters to consider trimming
    • int32_t dedupe_history_chars = 4096; // history size for character history
  2. Add new CLI flags parsing and help text: --dedupe {none,char,token} --min-token-overlap N --min-char-overlap N --dedupe-history-chars N

  3. Implement both dedupe methods inside examples/stream/stream.cpp printing logic:

    • Keep two histories: a) last_printed_text (string, normalized/capped) used for char-mode and as fallback b) last_printed_tokens (vector) used for token-mode

    • Character-level dedupe (existing approach): normalize text (lowercase, collapse leading whitespace), find longest suffix of history that equals prefix of new segment, require min_char_overlap and prefer word/boundary checks. Trim prefix when match found and append remainder to history.

    • Token-level dedupe:

      • For each new segment, gather token ids using whisper_full_get_token_id(ctx, seg_index, token_index) and whisper_full_n_tokens(ctx, seg_index) (these APIs already used elsewhere in the example).
      • Compute the longest suffix of last_printed_tokens that matches a prefix of new segment tokens.
      • If matched token_count >= min_token_overlap:
        • If token_count >= new_segment_token_count: skip printing the entire segment (fully duplicated)
        • Else: attempt to remove the first token_count tokens from the segment output.
      • To remove token_count tokens from the human-readable segment text, use one of the two approaches (prefer approach A, fallback to B): A) If API whisper_full_get_token_text(ctx, seg_idx, token_idx) is available in the repository, use it to compute the character length of the first token_count tokens and remove that prefix from the segment text. B) Otherwise, fall back to character-level trimming: use character-level longest-overlap logic between last_printed_text and the segment text and trim by characters. This keeps behavior safe when token text extraction is not available.
    • After printing the (possibly trimmed) segment output, append the printed token ids (only those that were actually printed) to last_printed_tokens and cap the token history to a reasonable value (e.g. 2048 tokens). Also cap last_printed_text to dedupe_history_chars.

  4. Default behavior: dedupe mode default should be "token" (if token-mode is unavailable at runtime it should fallback to "char"). Provide an option to disable dedupe via --dedupe none.

  5. Keep behavior for timestamps and output file writing consistent with trimming.

  6. Keep the code self-contained to examples/stream/stream.cpp and avoid changing other files. Add helper functions inside stream.cpp.

  7. Add clear comments and logging for what dedupe mode is used and when fallback happens.

Testing and validation

  • The PR should compile with the repository's current build (examples target). The implementation should be conservative (do not cut words mid-letter) and avoid false positives using minimal overlap thresholds.
  • Behavior: with default parameters (dedupe=token, min_token_overlap=3, min_char_overlap=3) streaming output should no longer repeatedly print the same words on overlap boundaries.

Deliverables

  • A PR modifying examples/stream/stream.cpp implementing the described CLI flags and both dedupe methods, defaults, and robust fallbacks.
  • Update the usage help printed by whisper_print_usage to include the new flags and defaults.

Notes / Assumptions

  • Use existing whisper_full APIs available in examples/stream (whisper_full_get_token_id, whisper_full_n_tokens, whisper_full_get_segment_text, whisper_full_n_segments, etc.). If a helper to extract a token's text exists (e.g. whisper_full_get_token_text or whisper_token_to_str), prefer that for precise trimming; otherwise use safe character-level fallback for partial trims.
  • Do not change other example behavior unrelated to dedupe.

Please create a PR that implements this change. Make the patch detailed and include comments in the code explaining the dedupe strategies and the fallback behavior when token-level text extraction isn't available.

This pull request was created as a result of the following prompt from Copilot chat.

Goal

Implement two configurable duplicate-text removal methods for the streaming example (examples/stream/stream.cpp):

  • token-level dedupe (preferred): use whisper segment token ids to detect overlap across previously printed text and new segments and avoid printing duplicate tokens
  • character-level dedupe (fallback / simple): normalized longest suffix/prefix match on printed text

Expose a command-line option to choose the dedupe mode and tuning parameters.

Requirements

  1. Add configuration options to whisper_params (examples/stream/stream.cpp):

    • std::string dedupe = "token"; // options: "none", "char", "token"
    • int32_t min_token_overlap = 3; // minimum matched tokens to consider trimming
    • int32_t min_char_overlap = 3; // minimum matched characters to consider trimming
    • int32_t dedupe_history_chars = 4096; // history size for character history
  2. Add new CLI flags parsing and help text: --dedupe {none,char,token} --min-token-overlap N --min-char-overlap N --dedupe-history-chars N

  3. Implement both dedupe methods inside examples/stream/stream.cpp printing logic:

    • Keep two histories: a) last_printed_text (string, normalized/capped) used for char-mode and as fallback b) last_printed_tokens (vector) used for token-mode

    • Character-level dedupe (existing approach): normalize text (lowercase, collapse leading whitespace), find longest suffix of history that equals prefix of new segment, require min_char_overlap and prefer word/boundary checks. Trim prefix when match found and append remainder to history.

    • Token-level dedupe:

      • For each new segment, gather token ids using whisper_full_get_token_id(ctx, seg_index, token_index) and whisper_full_n_tokens(ctx, seg_index) (these APIs already used elsewhere in the example).
      • Compute the longest suffix of last_printed_tokens that matches a prefix of new segment tokens.
      • If matched token_count >= min_token_overlap:
        • If token_count >= new_segment_token_count: skip printing the entire segment (fully duplicated)
        • Else: attempt to remove the first token_count tokens from the segment output.
      • To remove token_count tokens from the human-readable segment text, use one of the two approaches (prefer approach A, fallback to B): A) If API whisper_full_get_token_text(ctx, seg_idx, token_idx) is available in the repository, use it to compute the character length of the first token_count tokens and remove that prefix from the segment text. B) Otherwise, fall back to character-level trimming: use character-level longest-overlap logic between last_printed_text and the segment text and trim by characters. This keeps behavior safe when token text extraction is not available.
    • After printing the (possibly trimmed) segment output, append the printed token ids (only those that were actually printed) to last_printed_tokens and cap the token history to a reasonable value (e.g. 2048 tokens). Also cap last_printed_text to dedupe_history_chars.

  4. Default behavior: dedupe mode default should be "token" (if token-mode is unavailable at runtime it should fallback to "char"). Provide an option to disable dedupe via --dedupe none.

  5. Keep behavior for timestamps and output file writing consistent with trimming.

  6. Keep the code self-contained to examples/stream/stream.cpp and avoid changing other files. Add helper functions inside stream.cpp.

  7. Add clear comments and logging for what dedupe mode is used and when fallback happens.

Testing and validation

  • The PR should compile with the repository's current build (examples target). The implementation should be conservative (do not cut words mid-letter) and avoid false positives using minimal overlap thresholds.
  • Behavior: with default parameters (dedupe=token, min_token_overlap=3, min_char_overlap=3) streaming output should no longer repeatedly print the same words on overlap boundaries.

Deliverables

  • A PR modifying examples/stream/stream.cpp implementing the described CLI flags and both dedupe methods, defaults, and robust fallbacks.
  • Update the usage help printed by whisper_print_usage to include the new flags and defaults.

Notes / Assumptions

  • Use existing whisper_full APIs available in examples/stream (whisper_full_get_token_id, whisper_full_n_tokens, whisper_full_get_segment_text, whisper_full_n_segments, etc.). If a helper to extract a token's text exists (e.g. whisper_full_get_token_text or whisper_token_to_str), prefer that for precise trimming; otherwise use safe character-level fallback for partial trims.
  • Do not change other example behavior unrelated to dedupe.

Please create a PR that implements this change. Make the patch detailed and include comments in the code explaining the dedupe strategies and the fallback behavior when token-level text extraction isn't available.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot avatar Nov 18 '25 10:11 Copilot