agents Inconsistent behavior between RealtimeSession version and AudioRecognition version of generate

Inconsistent behavior between RealtimeSession version and AudioRecognition version of generate_reply

Open MSameerAbbas opened this issue 3 weeks ago • 1 comments

Bug Description

Currently, the generate_reply function processes instructions differently for the realtime vs audio recognition version implementations.

The audio recognition version appends the instructions, whereas the realtime api replaces the instructions.

Expected Behavior

For consistency I think both should either replace, or both should append, or honestly the best option would be if we can add a keyword argument append: bool or something that lets us choose.

Reproduction Steps

from livekit.agents import Agent, AgentSession, JobContext, cli, WorkerOptions, RoomInputOptions, JobProcess
from livekit.plugins.turn_detector.multilingual import MultilingualModel
from livekit.plugins import openai, silero, noise_cancellation
from dotenv import load_dotenv

load_dotenv()

def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()

async def entrypoint(ctx: JobContext):

    agent = Agent(
        instructions="The secret password is 'Banana'",
        llm=openai.realtime.RealtimeModel.with_azure(azure_deployment="gpt-realtime-dev")
    )
    session = AgentSession(
            vad=ctx.proc.userdata["vad"],
            turn_detection=MultilingualModel(),
    )
    await session.start(
        agent=agent,
        room=ctx.room,
        room_input_options=RoomInputOptions(
            noise_cancellation=noise_cancellation.BVC(), 
        ))
    
    await ctx.connect()
    
    await session.generate_reply(
        instructions="""What is the secret password?"""
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

The generate reply's output will probably say it doesn't know or can't provide the information, but if you ask it again (this time you're not relying on generate reply), you'll see it has the information. The only way I can make sense of this is if the system instructions are replacing the instructions.

Btw, there's no issue if I send a user_input instead of instructions.

Operating System

Windows 11

Models Used

gpt-realtime through azure

Package Versions

"livekit-agents[azure,deepgram,openai,silero,turn-detector]~=1.2",
"livekit-plugins-noise-cancellation~=0.2",

Session/Room/Call IDs

Don't think this matters cuz I was testing in the console.

Proposed Solution

If we see lines 831-836 of agent_activity.py

        elif isinstance(self.llm, llm.LLM):
            # instructions used inside generate_reply are "extra" instructions.
            # this matches the behavior of the Realtime API:
            # https://platform.openai.com/docs/api-reference/realtime-client-events/response/create
            if instructions:
                instructions = "\n".join([self._agent.instructions, instructions])

We see that the instructions are appended for the AudioRecognition implementation with a comment that says that the realtime API behaves the same way.

The realtime version's generate_reply() is sending the instructions as is in the response.create event:

    def generate_reply(
        self, *, instructions: NotGivenOr[str] = NOT_GIVEN
    ) -> asyncio.Future[llm.GenerationCreatedEvent]:
        event_id = utils.shortuuid("response_create_")
        fut = asyncio.Future[llm.GenerationCreatedEvent]()
        self._response_created_futures[event_id] = fut
        self.send_event(
            ResponseCreateEvent(
                type="response.create",
                event_id=event_id,
                response=RealtimeResponseCreateParams(
                    instructions=instructions or None,
                    metadata={"client_event_id": event_id},
                ),
            )
        )

I think we could append instructions to self._instructions before sending to make the behavior the same. Also, as a plus point we could add a parameter like append: bool which by default is set to true which will append in both the normal and realtime cases and can replace if it's set to false

Additional Context

No response

Screenshots and Recordings

No response

Nov 08 '25 21:11 MSameerAbbas

agents agents copied to clipboard

Inconsistent behavior between RealtimeSession version and AudioRecognition version of generate_reply

Bug Description

Expected Behavior

Reproduction Steps

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

agents
agents copied to clipboard