[WIP] Implement configurable duplicate-text removal methods

Open Copilot opened this issue 3 months ago • 0 comments

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.

Original prompt

Goal

Implement two configurable duplicate-text removal methods for the streaming example (examples/stream/stream.cpp):

token-level dedupe (preferred): use whisper segment token ids to detect overlap across previously printed text and new segments and avoid printing duplicate tokens
character-level dedupe (fallback / simple): normalized longest suffix/prefix match on printed text

Expose a command-line option to choose the dedupe mode and tuning parameters.

Requirements

Add configuration options to whisper_params (examples/stream/stream.cpp):
- std::string dedupe = "token"; // options: "none", "char", "token"
- int32_t min_token_overlap = 3; // minimum matched tokens to consider trimming
- int32_t min_char_overlap = 3; // minimum matched characters to consider trimming
- int32_t dedupe_history_chars = 4096; // history size for character history
Add new CLI flags parsing and help text: --dedupe {none,char,token} --min-token-overlap N --min-char-overlap N --dedupe-history-chars N
Implement both dedupe methods inside examples/stream/stream.cpp printing logic:
- Keep two histories: a) last_printed_text (string, normalized/capped) used for char-mode and as fallback b) last_printed_tokens (vector) used for token-mode
- Character-level dedupe (existing approach): normalize text (lowercase, collapse leading whitespace), find longest suffix of history that equals prefix of new segment, require min_char_overlap and prefer word/boundary checks. Trim prefix when match found and append remainder to history.
- Token-level dedupe:
  - For each new segment, gather token ids using whisper_full_get_token_id(ctx, seg_index, token_index) and whisper_full_n_tokens(ctx, seg_index) (these APIs already used elsewhere in the example).
  - Compute the longest suffix of last_printed_tokens that matches a prefix of new segment tokens.
  - If matched token_count >= min_token_overlap:
    - If token_count >= new_segment_token_count: skip printing the entire segment (fully duplicated)
    - Else: attempt to remove the first token_count tokens from the segment output.
  - To remove token_count tokens from the human-readable segment text, use one of the two approaches (prefer approach A, fallback to B): A) If API whisper_full_get_token_text(ctx, seg_idx, token_idx) is available in the repository, use it to compute the character length of the first token_count tokens and remove that prefix from the segment text. B) Otherwise, fall back to character-level trimming: use character-level longest-overlap logic between last_printed_text and the segment text and trim by characters. This keeps behavior safe when token text extraction is not available.
- After printing the (possibly trimmed) segment output, append the printed token ids (only those that were actually printed) to last_printed_tokens and cap the token history to a reasonable value (e.g. 2048 tokens). Also cap last_printed_text to dedupe_history_chars.
Default behavior: dedupe mode default should be "token" (if token-mode is unavailable at runtime it should fallback to "char"). Provide an option to disable dedupe via --dedupe none.
Keep behavior for timestamps and output file writing consistent with trimming.
Keep the code self-contained to examples/stream/stream.cpp and avoid changing other files. Add helper functions inside stream.cpp.
Add clear comments and logging for what dedupe mode is used and when fallback happens.

Testing and validation

The PR should compile with the repository's current build (examples target). The implementation should be conservative (do not cut words mid-letter) and avoid false positives using minimal overlap thresholds.
Behavior: with default parameters (dedupe=token, min_token_overlap=3, min_char_overlap=3) streaming output should no longer repeatedly print the same words on overlap boundaries.

Deliverables

A PR modifying examples/stream/stream.cpp implementing the described CLI flags and both dedupe methods, defaults, and robust fallbacks.
Update the usage help printed by whisper_print_usage to include the new flags and defaults.

Notes / Assumptions

Use existing whisper_full APIs available in examples/stream (whisper_full_get_token_id, whisper_full_n_tokens, whisper_full_get_segment_text, whisper_full_n_segments, etc.). If a helper to extract a token's text exists (e.g. whisper_full_get_token_text or whisper_token_to_str), prefer that for precise trimming; otherwise use safe character-level fallback for partial trims.
Do not change other example behavior unrelated to dedupe.

Please create a PR that implements this change. Make the patch detailed and include comments in the code explaining the dedupe strategies and the fallback behavior when token-level text extraction isn't available.

This pull request was created as a result of the following prompt from Copilot chat.

Goal

Implement two configurable duplicate-text removal methods for the streaming example (examples/stream/stream.cpp):

token-level dedupe (preferred): use whisper segment token ids to detect overlap across previously printed text and new segments and avoid printing duplicate tokens

character-level dedupe (fallback / simple): normalized longest suffix/prefix match on printed text

Expose a command-line option to choose the dedupe mode and tuning parameters.

Requirements

Add configuration options to whisper_params (examples/stream/stream.cpp):

std::string dedupe = "token"; // options: "none", "char", "token"

int32_t min_token_overlap = 3; // minimum matched tokens to consider trimming

int32_t min_char_overlap = 3; // minimum matched characters to consider trimming

int32_t dedupe_history_chars = 4096; // history size for character history

Add new CLI flags parsing and help text: --dedupe {none,char,token} --min-token-overlap N --min-char-overlap N --dedupe-history-chars N

Implement both dedupe methods inside examples/stream/stream.cpp printing logic:

Keep two histories: a) last_printed_text (string, normalized/capped) used for char-mode and as fallback b) last_printed_tokens (vector) used for token-mode

Character-level dedupe (existing approach): normalize text (lowercase, collapse leading whitespace), find longest suffix of history that equals prefix of new segment, require min_char_overlap and prefer word/boundary checks. Trim prefix when match found and append remainder to history.

Token-level dedupe:

For each new segment, gather token ids using whisper_full_get_token_id(ctx, seg_index, token_index) and whisper_full_n_tokens(ctx, seg_index) (these APIs already used elsewhere in the example).

Compute the longest suffix of last_printed_tokens that matches a prefix of new segment tokens.

If matched token_count >= min_token_overlap:

If token_count >= new_segment_token_count: skip printing the entire segment (fully duplicated)

Else: attempt to remove the first token_count tokens from the segment output.

To remove token_count tokens from the human-readable segment text, use one of the two approaches (prefer approach A, fallback to B): A) If API whisper_full_get_token_text(ctx, seg_idx, token_idx) is available in the repository, use it to compute the character length of the first token_count tokens and remove that prefix from the segment text. B) Otherwise, fall back to character-level trimming: use character-level longest-overlap logic between last_printed_text and the segment text and trim by characters. This keeps behavior safe when token text extraction is not available.

After printing the (possibly trimmed) segment output, append the printed token ids (only those that were actually printed) to last_printed_tokens and cap the token history to a reasonable value (e.g. 2048 tokens). Also cap last_printed_text to dedupe_history_chars.

Default behavior: dedupe mode default should be "token" (if token-mode is unavailable at runtime it should fallback to "char"). Provide an option to disable dedupe via --dedupe none.

Keep behavior for timestamps and output file writing consistent with trimming.

Keep the code self-contained to examples/stream/stream.cpp and avoid changing other files. Add helper functions inside stream.cpp.

Add clear comments and logging for what dedupe mode is used and when fallback happens.

Testing and validation

The PR should compile with the repository's current build (examples target). The implementation should be conservative (do not cut words mid-letter) and avoid false positives using minimal overlap thresholds.

Behavior: with default parameters (dedupe=token, min_token_overlap=3, min_char_overlap=3) streaming output should no longer repeatedly print the same words on overlap boundaries.

Deliverables

A PR modifying examples/stream/stream.cpp implementing the described CLI flags and both dedupe methods, defaults, and robust fallbacks.

Update the usage help printed by whisper_print_usage to include the new flags and defaults.

Notes / Assumptions

Use existing whisper_full APIs available in examples/stream (whisper_full_get_token_id, whisper_full_n_tokens, whisper_full_get_segment_text, whisper_full_n_segments, etc.). If a helper to extract a token's text exists (e.g. whisper_full_get_token_text or whisper_token_to_str), prefer that for precise trimming; otherwise use safe character-level fallback for partial trims.

Do not change other example behavior unrelated to dedupe.

Please create a PR that implements this change. Make the patch detailed and include comments in the code explaining the dedupe strategies and the fallback behavior when token-level text extraction isn't available.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Nov 18 '25 10:11 Copilot