Questions about the “New API (Under Development)
Hi @QuentinFuxa — first off, thank you so much for maintaining this great open-source project. I’m an individual developer using your library, and I really appreciate the time and care you put into it. 🙏
I’m integrating the new API and want to confirm a few details so I can implement the client/server logic correctly. Could you please clarify the points below?
1) Update & Merge Rules
Q1. segment.text update shape
-
Does each update send only newly confirmed additional text to be appended to what I already have?
-
Or can an update sometimes resend the entire current text for that
segment.id(i.e., full replace)?- If full replace can occur, should clients treat the latest value as authoritative?
Q2. Multiple entries with the same segment.id within a single update
-
Can the same
segment.idappear more than once in the samesegmentsarray?-
If yes:
- Is the array order the intended processing order?
- Should clients apply all items for that id in order (accumulate), or prefer only the last occurrence?
- Can duplicate content be repeated for the same id (and should clients de-duplicate)?
-
2) Silence Segments
Q3. Delivery shape for silence (speaker == -2)
- Is silence always delivered as a new segment with a new id appended to the list?
- Or can silence arrive as an update to an existing id?
Q4. Strength of “end of utterance” signal
- May I treat silence as a strong end-of-utterance signal, even if
buffer.transcriptionis not empty yet? - If recommended, is there a short gate (e.g., ~300 ms) you suggest before committing, to allow late confirmations?
3) Completion / Finalization Signals
Q5. Explicit per-segment completion signal
-
Is there any explicit field/event that means “this
segment.idwill not receive any more confirmed text”?- If yes, which field/value?
Q6. Recommended criteria to decide “final” (can be combined)
- (a)
buffer.transcription == "" - (b) Silence segment received (
speaker == -2) - (c) Idle timeout: What value/range do you recommend (e.g., 600–1200 ms)?
- (d) Any other explicit field?
4) Duplicates / Retransmissions
Q7. Duplicate updates
-
Can identical content be resent for the same
segment.id?- If yes, do you recommend id-level de-duplication on the client?
My Intended Handling (please confirm)
-
I keep a map keyed by
segment.idon both client and server.-
For
text: on each update I append newly confirmed text to the previously confirmed text.- If an update resends the full current text, I treat that latest value as authoritative.
-
For
buffer.*: I treat it as temporary display data, overwriting it on each update and expecting it to change in the next update. -
I re-render/broadcast only segments that actually changed.
-
-
Each update may contain multiple segments.
- I merge all segments in the batch into state (and, if the same id appears multiple times, I process them in the given order unless advised otherwise).
- For real-time subtitle UX, I broadcast only the last non-silence segment in that batch to the UI.
- If the batch contains any silence item, I treat it as a commit trigger (combined with my buffer/idle policy).
-
I mark a segment final (and persist to DB) when any of these holds:
buffer.transcriptionbecomes empty, or- Silence arrives and a short gate (~300 ms) passes with no further change, or
- Idle timeout elapses (~800 ms) with no further change.
When finalizing, I store only the finalized text for that
segment.idand clear its in-memory state.
If any of the above differs from the intended API semantics, please let me know what to adjust. Thanks again for your work on the project!
Hi, good questions, I should have a first release of the new API this week. For your points, they are correct, some details:
If an update resends the full current text, I treat that latest value as authoritative.
I do not see a case where an update would resent the full text. Validated text is validated, there is no reason it changes
For real-time subtitle UX, I broadcast only the last non-silence segment in that batch to the UI.
If two persons talk one just after another, it may be wiser to broadcast more than one. Example :
- ... and that's it
- Oh wow
For your third point, good question, i may introduce a is_finalized key to indicate if the segment is over and all associated processing is done. The API may evoluate while I work on it if I realise some cases cannot be properly handled
Hi, good questions, I should have a first release of the new API this week. For your points, they are correct, some details:
If an update resends the full current text, I treat that latest value as authoritative.
I do not see a case where an update would resent the full text. Validated text is validated, there is no reason it changes
For real-time subtitle UX, I broadcast only the last non-silence segment in that batch to the UI.
If two persons talk one just after another, it may be wiser to broadcast more than one. Example :
- ... and that's it
- Oh wow
For your third point, good question, i may introduce a is_finalized key to indicate if the segment is over and all associated processing is done. The API may evoluate while I work on it if I realise some cases cannot be properly handled
You can check if I understand your ideas about switching to the new API correctly.