diart Add speaker-aware transcription

Depends on #144

This PR adds a new SpeakerAwareTranscription pipeline that combines streaming diarization and streaming transcription to determine "who says what" in a live conversation. By default, this is shown as colored words in the terminal.

The feature works as expected with diart.stream and diart.serve/diart.client. The main thing preventing full compatibility with diart.benchmark and diart.tune is the evaluation metric. Since the output of the pipeline is annotated text with the format: [speaker0]Hello [speaker1]Hi, the metric diart.metrics.WordErrorRate will count labels as insertion errors.

Next steps: implement a SpeakerWordErrorRate that computes the (weighted?) average WER across speakers.

Changelog

TBD

Apr 26 '23 12:04 juanmc2005

Hey, I am unable to use this: (diart) :~/live-transcript$ diart.stream output.wav --pipeline SpeakerAwareTranscription Traceback (most recent call last): File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/rx/core/operators/map.py", line 37, in on_next result = _mapper(value) File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/diart/pipelines/speaker_transcription.py", line 325, in call asr_outputs = self.asr(batch[has_voice]) File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/diart/blocks/asr.py", line 65, in call output = self.model(wave.to(self.device)) File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/diart/models.py", line 80, in call return super().call(*args, **kwargs) File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/diart/models.py", line 485, in forward batch = whisper.log_mel_spectrogram(batch) File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/whisper/audio.py", line 148, in log_mel_spectrogram stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True) File "/home/user/anaconda3/envs/diart/lib/python3.8/site-packages/torch/functional.py", line 632, in stft return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

Apr 06 '24 07:04 C0RE1312

@C0RE1312 sounds like a problem with pytorch not being able to compute the FFT. Have you tried updating the dependenciesof both torch and whisper? it's a pretty old PR

Apr 19 '24 12:04 juanmc2005

@C0RE1312 sounds like a problem with pytorch not being able to compute the FFT. Have you tried updating the dependenciesof both torch and whisper? it's a pretty old PR

Will this work with faster-whisper or any other faster version of whisper?

Dec 02 '24 02:12 ywangwxd

@C0RE1312 sounds like a problem with pytorch not being able to compute the FFT. Have you tried updating the dependenciesof both torch and whisper? it's a pretty old PR

BTW, I noticed that the last commit was in the April of 2023. So this feature has no new commits for more than one year. Do this mean the feature implementation has finished but it was not merged into the main branch? I noticed in the readme page of this project, there was a note stating that this feature was comming soon but ready.

Dec 02 '24 03:12 ywangwxd

@ywangwxd unfortunately I haven't had the time to work on this as I'd like. I prioritized other things like documentation and testing for #98 This should work okay if we update the dependencies and resolve conflicts with the main branch, but I think it's better to do a full rework because it's not very efficient. To begin with, it's using OpenAI's whisper, while there are faster implementations out there. Please feel free to improve and work on top of this branch and open your own transcription PR, I can take a look regularly and guide you if you want to contribute.

Dec 05 '24 10:12 juanmc2005

Nice job!!! can't wait to see the next update.

Mar 11 '25 05:03 obitoquilt

diart diart copied to clipboard

Add speaker-aware transcription

Changelog

diart
diart copied to clipboard