README addition proposal: summary overview of different models to help newbie's choose
Firstly, congrats on a great piece of software, thanks all who work on it.
I feel like somewhere on the web I saw someone advise on which model to choose, but I've since lost it. I figure it would be a real benefit to new users to have a brief explainer of the pros & cons of each model (& sub model), ideally ordered by which model most people should choose. You've done this a bit in the settings, model choice, with the speed & quality metrics, but there's no obvious way to choose between different models which share the same metrics.
Nor does one get a sense of the value of the tradeoff between size and performance: on a new system install, potentially I don't care about 30gb vs 45mb, IF the 30gb install is perfect and I never think about it again, and IF the 45mb model is rubbish and makes me abandon the program entirely.
I suspect you'd have to update these descriptions and the leaderboard order as models are updated, but I kinda assume you're keeping abreast of the developments in this space so that may not be the biggest deal?
As well as paragraph style with an 'abstract' first para that gives folks the minimal info for what most people should choose, perhaps you could have a second table, this time with models as rows, and features/ratings as columns.
Cheers in advance!
edit: potentially could link to this for info/background, though it looks to have different models than offered here?
Hi. Thank you for sharing your idea and all your suggestions.
In short, it's difficult. I deliberately decided not to evaluate models and their capabilities because it requires too much effort. Moreover, objective evaluation is almost impossible. Some models work great, but only with specific GPU type, while others are more versatile in terms of hardware requirements but less accurate or generate less natural speech. To do this well, Speech Note would have to detect the hardware and decide what is recommended based on that. I'm not saying it can't be done - it can, but it's difficult.
I realize there is a problem. The user is confused because there are too many choices. If they make the wrong choice, the result will be "rubbish". This problem needs to be solved somehow.
I am thinking of a simpler solution. For each language, there will be a set of "recommended" STT+TTS models. These will be models that work on almost any hardware and are "good enough". This will allow the user to simply click "install recommended" and start using the app without the risk of making the "wrong choice".
Recommended is a great idea - expert curation is always a good solution IMO.
If you already have/know a combo of what works with what hardware, potentially you could link to a (non editable, or only trusted parties editable) google spreadsheet, so users can filter &/or sort for their own setup, essentially finding the curated recommendations pertinent to them?
I think I will be able to select models that can potentially work on any hardware. It's not very difficult - you could say that WhisperCpp Base + any Piper Medium should be a "good enough" configuration. Compiling a compatibility matrix for more advanced models is much more difficult task...
Hi bud, thanks for the intel. Just circling back to this as I'm not sure how else to find this out:
On my Android Google Keyboard microphone... 'feature' (?) I can speak and get seemingly-accurate text back almost immediately, with zero setup. So relating to processing speed vs quality vs setup scores: 10 (faster is better), 9? (I haven't tested it extensively), 10 (already installed)
Whereas I've not found any speed difference between WhisperCpp Base & FasterWhisper Large-v3 Turbo: from the point I stop speaking, it takes a little over 5 seconds to get text back. Which would potentially be fine if I was sending a huge long stream of audio, but because the detection threshold for 'speech ended' is so strict, I can only send small snippets anyway. So I say a single line, then have to wait around 6 or 7 seconds before I'm ready to go again. So scores-wise: 2, 10, 1
I feel like I'm probably missing something though, or have it setup poorly.
- Do you know if there's a recommended best 'fast' model, i.e. as good quality as possible but don't sacrifice speed?
- I'll try 'press & hold' for the listening mode
- With press once, and press & hold, all the audio is sent after the listening has completed correct? But with Always On... when does it process? Whenever it hears any speech and then silence? So the option to Pause Listening While Processing would suggest you
- start it listening
- begin talking
- stop talking; break detected; (listening paused if setting activated) processing starts; processing complete
- STT returned ?
Thanks!
ChatGPT: If it’s slow in SpeechNote: You may be running the large model on CPU only → very slow. Try base or small models for real-time dictation. Build with GPU (CUDA/ROCm/OpenCL) support for much faster inference.
Optimized forks:
faster-whisper (C++/Python bindings with quantized models).
whisper.cpp with --threads and --gpu options tuned for your CPU/GPU.
On your PC, Whisper likely runs in full-sentence mode (waits for pauses, processes large chunks).
On your phone, Google’s pipeline does per-word incremental recognition, then adjusts words on the fly.
Whisper can be configured to do streaming transcription — but SpeechNote may not be using those flags.
Before abandoning Whisper, I’d check: are you running --translate --task transcribe --model base.en --threads
@SimonDedman Sorry, I was on vacation and missed your comment.
Which would potentially be fine if I was sending a huge long stream of audio, but because the detection threshold for 'speech ended' is so strict, I can only send small snippets anyway. So I say a single line, then have to wait around 6 or 7 seconds before I'm ready to go again
Yep, unfortunately it works like this at the moment. This 'speech ended' timeout is hard coded. I will make it configurable in the next version.
Do you know if there's a recommended best 'fast' model, i.e. as good quality as possible but don't sacrifice speed?
Not really. If accuracy is not a top priority, you can try Vosk Small or Vosk Large. Vosk supports "intermediate results", which means you get results in a live stream. Whisper models do not support this.
With press once, and press & hold, all the audio is sent after the listening has completed correct? But with Always On... when does it process?
Exactly, in "Press & hold" mode, recording ends when you release the button and the recorded audio data is processed. In "Always On" mode, STT processing starts as soon as silence is detected. Since Whisper processes audio data in batches, Speech Note cuts the audio data into chunks that are divided at points of silence.
Whisper can be configured to do streaming transcription — but SpeechNote may not be using those flags.
This is a new feature in the current version of the whisper-cpp library. Speech Note uses an older version, so this feature is not yet available. I will investigate how "streaming in Whisper" can be implemented in a future Speech Note version.
Brilliant stuff, thanks for letting me know - I'll subscribe to package updated. Hope you had a great holiday!
Hello mate, just circling back to this - is there any way you know that I can stay updated on when (/if) this feature drops?
Hi. Unfortunately, not soon, sorry. I haven't started work on the new version yet. Right now, I'm focusing on another project. I plan to move my full attention back to Speech Note at the beginning of next year.
this feature drops
This feature means "streaming transcription", right?
This feature means "streaming transcription", right?
Sorry yeah streaming transcription.
Thanks!