WhisperLiveKit icon indicating copy to clipboard operation
WhisperLiveKit copied to clipboard

Questions about the “New API (Under Development)

Open YeonjunNotFR opened this issue 2 months ago • 2 comments

Hi @QuentinFuxa — first off, thank you so much for maintaining this great open-source project. I’m an individual developer using your library, and I really appreciate the time and care you put into it. 🙏

I’m integrating the new API and want to confirm a few details so I can implement the client/server logic correctly. Could you please clarify the points below?


1) Update & Merge Rules

Q1. segment.text update shape

  • Does each update send only newly confirmed additional text to be appended to what I already have?

  • Or can an update sometimes resend the entire current text for that segment.id (i.e., full replace)?

    • If full replace can occur, should clients treat the latest value as authoritative?

Q2. Multiple entries with the same segment.id within a single update

  • Can the same segment.id appear more than once in the same segments array?

    • If yes:

      • Is the array order the intended processing order?
      • Should clients apply all items for that id in order (accumulate), or prefer only the last occurrence?
      • Can duplicate content be repeated for the same id (and should clients de-duplicate)?

2) Silence Segments

Q3. Delivery shape for silence (speaker == -2)

  • Is silence always delivered as a new segment with a new id appended to the list?
  • Or can silence arrive as an update to an existing id?

Q4. Strength of “end of utterance” signal

  • May I treat silence as a strong end-of-utterance signal, even if buffer.transcription is not empty yet?
  • If recommended, is there a short gate (e.g., ~300 ms) you suggest before committing, to allow late confirmations?

3) Completion / Finalization Signals

Q5. Explicit per-segment completion signal

  • Is there any explicit field/event that means “this segment.id will not receive any more confirmed text”?

    • If yes, which field/value?

Q6. Recommended criteria to decide “final” (can be combined)

  • (a) buffer.transcription == ""
  • (b) Silence segment received (speaker == -2)
  • (c) Idle timeout: What value/range do you recommend (e.g., 600–1200 ms)?
  • (d) Any other explicit field?

4) Duplicates / Retransmissions

Q7. Duplicate updates

  • Can identical content be resent for the same segment.id?

    • If yes, do you recommend id-level de-duplication on the client?

My Intended Handling (please confirm)

  1. I keep a map keyed by segment.id on both client and server.

    • For text: on each update I append newly confirmed text to the previously confirmed text.

      • If an update resends the full current text, I treat that latest value as authoritative.
    • For buffer.*: I treat it as temporary display data, overwriting it on each update and expecting it to change in the next update.

    • I re-render/broadcast only segments that actually changed.

  2. Each update may contain multiple segments.

    • I merge all segments in the batch into state (and, if the same id appears multiple times, I process them in the given order unless advised otherwise).
    • For real-time subtitle UX, I broadcast only the last non-silence segment in that batch to the UI.
    • If the batch contains any silence item, I treat it as a commit trigger (combined with my buffer/idle policy).
  3. I mark a segment final (and persist to DB) when any of these holds:

    • buffer.transcription becomes empty, or
    • Silence arrives and a short gate (~300 ms) passes with no further change, or
    • Idle timeout elapses (~800 ms) with no further change. When finalizing, I store only the finalized text for that segment.id and clear its in-memory state.

If any of the above differs from the intended API semantics, please let me know what to adjust. Thanks again for your work on the project!

YeonjunNotFR avatar Oct 27 '25 01:10 YeonjunNotFR

Hi, good questions, I should have a first release of the new API this week. For your points, they are correct, some details:

If an update resends the full current text, I treat that latest value as authoritative.

I do not see a case where an update would resent the full text. Validated text is validated, there is no reason it changes

For real-time subtitle UX, I broadcast only the last non-silence segment in that batch to the UI.

If two persons talk one just after another, it may be wiser to broadcast more than one. Example :

  • ... and that's it
  • Oh wow

For your third point, good question, i may introduce a is_finalized key to indicate if the segment is over and all associated processing is done. The API may evoluate while I work on it if I realise some cases cannot be properly handled

QuentinFuxa avatar Oct 27 '25 08:10 QuentinFuxa

Hi, good questions, I should have a first release of the new API this week. For your points, they are correct, some details:

If an update resends the full current text, I treat that latest value as authoritative.

I do not see a case where an update would resent the full text. Validated text is validated, there is no reason it changes

For real-time subtitle UX, I broadcast only the last non-silence segment in that batch to the UI.

If two persons talk one just after another, it may be wiser to broadcast more than one. Example :

  • ... and that's it
  • Oh wow

For your third point, good question, i may introduce a is_finalized key to indicate if the segment is over and all associated processing is done. The API may evoluate while I work on it if I realise some cases cannot be properly handled

You can check if I understand your ideas about switching to the new API correctly.

MIGRATION_PLAN_ENG.md

sh1man999 avatar Nov 12 '25 13:11 sh1man999