Interrupted by itself when speaker on
using the Simple-chatbot example and when I turn on speaker, the bot heard itself and interrupted. is there a way to avoid this?
That can be tough, because the bot talking sounds like a person. :) Were you using the Daily transport? Were you using Silero VAD, or the built-in WebRTC VAD?
One option could be Speaker Diarization, which first detects voice (human voice or speaker's voice), then compares the detected voice with the voice currently speaking, and then decides if interruption is needed.
Would it be possible to disable VAD only until the LLM is finished talking? Disabling VAD through Daily Transport now fully disables getting a response from LLM.
@chadbailey59 I've been using the Daily transport and Silero VAD
Any news on this? I feel having the microphone and the speaker in the same room is quite a common use case.
Whats interesting is that this issue doesnt happen with daily. So I assume daily already filters out output audio?
So I assume daily already filters out output audio?
Yes, because Daily is using the browser's WebRTC support under the hood, and all the big browsers have built-in noise reduction and echo cancellation to help with this problem. (That's also why you can use Google Meet and other things with your built-in mic and speaker and you don't hear crazy echo all the time.)
Would it be possible to disable VAD only until the LLM is finished talking?
This is tricky, but I'm happy to discuss it. Right now pipecat does have the option to disable interruptions:
task = PipelineTask(
pipeline,
PipelineParams(
allow_interruptions=False,
enable_metrics=True,
enable_usage_metrics=True,
report_only_initial_ttfb=True,
),
)
But unfortunately, that's still processing VAD and transcribing user text while the bot is speaking. So if you say "tell me a joke", and then while the bot is speaking you say "tell me a fact about penguins", the bot won't get interrupted—it will finish its joke—but then it will immediately tell you a fact about penguins.
One option would be, as you said, to disable VAD while the bot is talking. But I'm worried about edge cases, where a user starts talking right as the bot finishes, and we end up with a messy overlap where VAD starts a bit too late. We had a lot of trouble with that back in the pre-Pipecat days of working on this problem.
Speaker diarization is another interesting idea, although it does seem like a fairly complex solution to the problem. You could probably do that in a parallel pipeline, but it would be really complex, probably prone to bugs, and it would almost certainly increase the latency of a legit response from the bot, because it would need to wait a certain amount of time to determine the validity of each interruption.
Honestly, I'd recommend grabbing a cheap pair of headphones and moving on to the next problem to solve. :) But if you're determined to work on this, the place to look is actually the response aggregator. The naming is a bit weird, because the one that actually matters for the user response is here for OpenAI-style LLMs, and here for other LLMs. This 'accumulator' watches for VAD activation (UserStartedSpeakingFrame), accumulates transcripts that follow, and stops when VAD stops (UserStoppedSpeakingFrame). You could play around with this approach if you want.
Closing this issue as it is pretty old. To sum up- this issue is controlled by the transport. you may experience this issue with LocalAudioTransport but will not experience it with a transport using webrtc (such as DailyTransport) because it handles audio loopback / echo cancellation.