`SegmentData.Start` and `SegmentData.End` wrap around after 30 seconds
ChatGPT discussion
- [x] I asked Whisper.net Helper and it did provide a workaround but I believe behavior should be changed, or explicitly documented, here is the discussion link: https://chatgpt.com/g/g-GQU8iEnAa-whisper-net-helper/c/67e45f5d-1a50-8002-b8ab-24bdf0ba0345
Describe the bug
SegmentData.Start and SegmentData.End wrap around after 30 seconds
To Reproduce Steps to reproduce the behavior:
- Install 'Whisper.net' with any runtime
- Create the WhisperProcessor using the default builder +
.WithLanguageDetection() - Use any wave file with > 30 seconds of speech or generate it
- Decode the file and repeatedly use
ProcessAsync(ReadOnlyMemory<float>)on chunks of audio - Iterate through the resulting segments
Expected behavior
SegmentData.Start at 31st second should return 00:00:31. Actually returns 00:00:01.
Same potentially applies to data in SegmentData.Tokens
Desktop / Servers (please complete the following information):
- OS Version: Windows
- Whisper Version: Large v3 Turbo
- Runtime [e.g. Whisper.net.Runtime.Cuda]
- GPU Driver version
Additional context N/A
Hello @lostmsu ,
The problem seems to be with the way how you're processing the file (calling ProcessAsync() multiple times with chunks of audio, supposedly each one with 30 seconds of data).
Each individual call can be considered an individual audio file, this is why the SegmentData.Start will be again 00:00:01, as that is a new call with a new file that actually have the first segment at its own first segment of audio.
More than that, by doing so, you can split some word which will be at exactly 30 seconds and that one will create errors in transcript, for both calls (initial one will understand part of it, while the second one other parts (or just gibberish).
The state will also be lost.
I recommend you pass the entire file if it was already computed (considering that it's not too big and fits into memory).
If the file is too big or you have a continuous stream of data (e.g. realtime data) => I recommend you check my other library https://github.com/sandrohanea/echosharp which can partially decode audio and combine it with voice activity detection in order to not cut the words with unaware chunks of data.
This is a good example for processing streams of data => https://github.com/sandrohanea/echosharp/blob/main/examples/EchoSharp.Example.MicrosophoneSpeechTranscript/Program.cs
But same approach can be applied with really long static files as well. Here is the code that was doing this part: https://github.com/sandrohanea/echosharp/blob/main/src/EchoSharp/SpeechTranscription/EchoSharpRealtimeTranscriptor.cs
Here is exactly the part that adjust the StartTime based on the processed duration (as each "segment" is processed individually, it will have an offset of start time) => https://github.com/sandrohanea/echosharp/blob/4fc139a77b623e850975f3de090aa2fd26d23876/src/EchoSharp/SpeechTranscription/EchoSharpRealtimeTranscriptor.cs#L214C13-L214C52
It is indeed a nice addition to echosharp to provide some examples and utility for processing long audios (not only continuous data), will consider.
This is by design in Whisper.net: to provide the start time relative to the audio that was passed during ProcessAsync().
However, if a longer audio is passed in one call to ProcessAsync, it can result in bigger StartTime intervals.