diart icon indicating copy to clipboard operation
diart copied to clipboard

Speaker-blind speech recognition

Open juanmc2005 opened this issue 2 years ago • 6 comments

Depends on #143

Adding a streaming ASR pipeline needed a big refactoring (that began with #143). This PR continues this effort to allow a new type of pipeline that transcribes speech instead of segmenting it. A default ASR model based on Whisper is provided, but the dependency is not mandatory.

Additional modifications were also needed to make Whisper compatible with batched inference. Note that we do not condition Whisper on previous transcriptions here. I expected this to degrade transcription quality but I found it rather robust in my experiments with the microphone and spontaneous speech in various languages (English, Spanish and French).

The new Transcription pipeline can also use a segmentation model as a local VAD to skip non-voiced chunks. In my experiments, this worked better and faster than using Whisper's no_speech_prob.

Transcription is also compatible with diart.stream, diart.benchmark, diart.tune and diart.serve (hence diart.client too).

Still missing

  • README examples and possible restructuring

Changelog

TBD

juanmc2005 avatar Apr 24 '23 09:04 juanmc2005