diart
diart copied to clipboard
Speaker-blind speech recognition
Depends on #143
Adding a streaming ASR pipeline needed a big refactoring (that began with #143). This PR continues this effort to allow a new type of pipeline that transcribes speech instead of segmenting it. A default ASR model based on Whisper is provided, but the dependency is not mandatory.
Additional modifications were also needed to make Whisper compatible with batched inference. Note that we do not condition Whisper on previous transcriptions here. I expected this to degrade transcription quality but I found it rather robust in my experiments with the microphone and spontaneous speech in various languages (English, Spanish and French).
The new Transcription pipeline can also use a segmentation model as a local VAD to skip non-voiced chunks. In my experiments, this worked better and faster than using Whisper's no_speech_prob.
Transcription is also compatible with diart.stream, diart.benchmark, diart.tune and diart.serve (hence diart.client too).
Still missing
- README examples and possible restructuring
Changelog
TBD