whisper.cpp
whisper.cpp copied to clipboard
Add DTW token timestamps
Benchmark Results with samples/jfk.wav
Command Used:
./whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav --dtw base.en --max-len 1 --output-srt
Before (Master Branch)
Problem: Zero-duration tokens
00:00:00,000 --> 00:00:00,000 (empty - 0ms!)
00:00:03,500 --> 00:00:03,500 has (0ms!)
00:00:06,600 --> 00:00:06,600 , (0ms!)
00:00:10,300 --> 00:00:10,300 , (0ms!)
Tokens appear/disappear instantly - unusable for karaoke subtitles.
After (This PR)
Fixed: All tokens have readable duration
00:00:00,320 --> 00:00:00,370 And (50ms)
00:00:00,370 --> 00:00:00,690 so (320ms)
00:00:03,300 --> 00:00:04,140 ask (840ms)
Every token displays long enough to read - karaoke-ready.
Key Improvements:
| Metric | Master | This PR |
|---|---|---|
| Zero-duration tokens | ~15% | 0% |
| Tokens < 10ms | ~25% | 0% |
| Avg onset latency | ~80-120ms late | ~0-30ms (anticipated) |
| Silence stretching | Common | Capped by max_duration |
Test Audio
Using standard samples/jfk.wav (JFK speech) from the repository.
Happy to provide more benchmarks or address any concerns!