RealtimeSTT New feature: speaker recognition

Would that be possible?

I'm new to this topic and am still reading up here. So far I didn't made any tests by myself yet, but it should work directly after silero recognized a voice to check if the voice is a specific (trained) voice so the STT works only for one or more trained voices, if the voice is unknown it will get ignored and is never send to whisper.

May 07 '25 11:05 Hotohori

Somewhat possible but hard to realize reliably.

One thing that's really hard to solve it diarization latency. We need at least 1.5 sec of audio material from a speaker to compare it to a reference audio. If we take less it get's really unreliable. Then we sometimes have to deal with unclean audio (other speakers cutting in within that 1.5 sec or background noises etc). It's doable but far from trivial.

May 07 '25 12:05 KoljaB

Hm, so there is no solution yet to use under 1.5 seconds audio to check it. You can say a lot short things in under 1.5 seconds...

Maybe you can clean up audio with a filter, there are a lot of possibilities.

Well. The main reason I thought about it in the first place was that when RealtimeTTS is performing a voice output, RealtimeSTT sometimes recognizes and transcribes its voice, interrupting itself and triggering a new LLM generation. Had a very rudimentary feature built in to be able to interject into the AI and than this happened again and again. Typical feedback loop.

So I came to this idea, because that would be the smartest solution and would be a good solution for other problems as well. But I guess that was a way too easy thought.

I read you can also check the mic input against your speaker output, but no clue yet if that would be a better/easier solution.

May 07 '25 16:05 Hotohori

I'm fully aware of this problem. The most straightforward solution is echo cancellation. Most browsers have this buildin, so if you're end device is a browser you have it mostly solved. It get's complicated if you want a native python solution. There are no really good performing echo-cancellation libraries out there.

The speaker diarization solution is not perfect. You'd want 100% safety. Even with 1.5 sec of clean audio material it's really hard to absolutely reliably tell "this is speaker xy".

You can see here how long different embeddings providers need to tell when a speaker changes:

May 07 '25 16:05 KoljaB

I see, that looks indeed not very good. Echo cancellation sounds better here. Not sure yet if I want to use the browser solution, way more limited and bound to a web app, but I guess for the beginning it is fine.

Thanks for that overview.

May 07 '25 18:05 Hotohori

Any advice on how it might be implemented if I were okay with a delay? I'm thinking the STT can be realtime and the diarization is also realtime but lags behind by 10 seconds or something. My particular use case would be fine with that

May 09 '25 03:05 ObjectiveTruth

Use callbacks on_realtime_transcription_update or on_realtime_transcription_stabilized to retrieve the realtime text. In the callback method access the current audio bytes.

I suggest processing that in another thread to not block the main realtimestt processing. I use signaling with QtCore.pyqtSignal(str, np.ndarray) here, but putting everything into a queue and processing that from another thread is totally fine.

        self.realtime_audio_array_signal = QtCore.pyqtSignal(str, np.ndarray)

        def realtime_transcription_update(text):
            audio_array = np.frombuffer(
                b''.join(self.recorder.frames),
                dtype=np.int16
            )
            self.realtime_audio_array_signal.emit(text, audio_array)

Now you have the audio bytes from the speaker, updated for every partial transcription. Use these bytes and compare them with a voice embedding provider.

Some code:

    def _get_current_chunk_embedding(self, audio_buffer, last_seconds : float = None):
        # Convert buffer to float32 numpy array and normalize
        chunk_np = np.frombuffer(bytes(audio_buffer), dtype=np.int16).astype(np.float32)
        chunk_np /= 32768.0

        # If last_seconds is set and valid, trim the buffer
        if last_seconds and last_seconds > 0:
            num_samples = int(self.sample_rate * last_seconds)
            if chunk_np.size > num_samples:
                chunk_np = chunk_np[-num_samples:]
        
        # Check for empty buffer after trimming
        if chunk_np.size == 0:
            return {"resemblyzer": None, "ecapa": None, "pyannote": None}


        # ---- RMS NORMALIZATION FOR QUIET AUDIO ----
        # This boosts the overall level if it's really low
        rms = np.sqrt(np.mean(chunk_np ** 2))
        target_rms = 0.1  # Adjust to taste
        if rms > 1e-8:     # Avoid division by zero
            chunk_np *= (target_rms / rms)
        # -------------------------------------------

        # Save the trimmed audio to a temporary WAV file
        tmp_wav = "temp_mic_chunk.wav"
        waveform_torch = torch.from_numpy(chunk_np).unsqueeze(0)
        torchaudio.save(tmp_wav, waveform_torch, self.sample_rate)

        result = {}
        # Extract embeddings using provided extractors
        for name, extractor in [
            ("resemblyzer", self.resemblyzer_extractor),
            ("ecapa", self.ecapa_extractor),
            ("pyannote", self.pyannote_extractor),
        ]:
            try:
                emb = extractor.get_speaker_embedding(tmp_wav)
            except:
                emb = None
            result[name] = emb

        return result

May 09 '25 09:05 KoljaB

Thanks so much @KoljaB !

May 25 '25 08:05 neo-picasso-2112