agents icon indicating copy to clipboard operation
agents copied to clipboard

allow pushing frames to VAD when agent speech is uninterruptible

Open chenghao-mou opened this issue 3 weeks ago • 4 comments

This should close #4413

What happened:

  • VAD received audio frames, changing user stage to speaking;
  • Uninterruptible speech created, discarding audio frames for both STT and VAD. User state is stuck in speaking.

This PR should allow VAD to operate separately. Tested with

    @function_tool
    async def get_weather(self, location: str) -> str:
        """
        Called when the user asks about the weather.

        Args:
            location: The location to get the weather for
        """
        await asyncio.sleep(5) # <- interrupt here!
        self.session.say("And tomorrow is going to be sunny too.", allow_interruptions=False)
        return f"The weather in {location} is sunny today."

chenghao-mou avatar Dec 30 '25 21:12 chenghao-mou

lgtm! do you think if we should ignore the user silence event if the speech is uninterruptible?

Do you mean skip waiting for user silence when the uninterruptible agent speech hasn't started? I think we should keep it because to the user, the agent is not yet speaking and they are not done talking, especially now with VAD being enabled at all time.

chenghao-mou avatar Dec 31 '25 08:12 chenghao-mou

I think we should keep it because to the user, the agent is not yet speaking and they are not done talking, especially now with VAD being enabled at all time.

but no matter what users said, the speech won't be cancelled. I think if a speech is uninterruptible, we should let it to interrupt the user asap. otherwise, the user may speak a lot but finally find the agent actually doesn't response to it.

longcw avatar Dec 31 '25 14:12 longcw

but no matter what users said, the speech won't be cancelled. I think if a speech is uninterruptible, we should let it to interrupt the user asap. otherwise, the user may speak a lot but finally find the agent actually doesn't response to it.

That makes sense. But part of me feels we need to differentiate uninterruptible (once started) and uncancellable (once created), skipping the silence wait might be a good temporary solution.

chenghao-mou avatar Dec 31 '25 14:12 chenghao-mou

I think the context of this matters:

  1. If the user is speaking within a tool call that is not interruptible, the agent should be able to get out of the function call, so I don't think its necessarily true that if a speech is uninterruptible, it will always be said.
  2. If the user is speaking within a tool call and the tool call is interruptible, then I think both @longcw and @chenghao-mou points are valid; the question is do you want the agent to interrupt immediately or wait for user to talk. I think the connotation of allow_interruptions in a say or generate_reply function just means the speech shouldn't be interrupted once its in progress - not whether if it will be said no matter what.

aumeshm avatar Dec 31 '25 17:12 aumeshm

Going to merge this now to unblock the original issue. Opened a new one for better discussion and tracking.

chenghao-mou avatar Jan 02 '26 16:01 chenghao-mou

@chenghao-mou let's ignore the event for uninterruptible speech, otherwise I think it will create another bug that the uninterruptible greeting message is delayed by the user speech.

longcw avatar Jan 03 '26 01:01 longcw