agents
agents copied to clipboard
Inconsistent behavior between RealtimeSession version and AudioRecognition version of generate_reply
Bug Description
Currently, the generate_reply function processes instructions differently for the realtime vs audio recognition version implementations.
The audio recognition version appends the instructions, whereas the realtime api replaces the instructions.
Expected Behavior
For consistency I think both should either replace, or both should append, or honestly the best option would be if we can add a keyword argument append: bool or something that lets us choose.
Reproduction Steps
from livekit.agents import Agent, AgentSession, JobContext, cli, WorkerOptions, RoomInputOptions, JobProcess
from livekit.plugins.turn_detector.multilingual import MultilingualModel
from livekit.plugins import openai, silero, noise_cancellation
from dotenv import load_dotenv
load_dotenv()
def prewarm(proc: JobProcess):
proc.userdata["vad"] = silero.VAD.load()
async def entrypoint(ctx: JobContext):
agent = Agent(
instructions="The secret password is 'Banana'",
llm=openai.realtime.RealtimeModel.with_azure(azure_deployment="gpt-realtime-dev")
)
session = AgentSession(
vad=ctx.proc.userdata["vad"],
turn_detection=MultilingualModel(),
)
await session.start(
agent=agent,
room=ctx.room,
room_input_options=RoomInputOptions(
noise_cancellation=noise_cancellation.BVC(),
))
await ctx.connect()
await session.generate_reply(
instructions="""What is the secret password?"""
)
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))
The generate reply's output will probably say it doesn't know or can't provide the information, but if you ask it again (this time you're not relying on generate reply), you'll see it has the information. The only way I can make sense of this is if the system instructions are replacing the instructions.
Btw, there's no issue if I send a user_input instead of instructions.
Operating System
Windows 11
Models Used
gpt-realtime through azure
Package Versions
"livekit-agents[azure,deepgram,openai,silero,turn-detector]~=1.2",
"livekit-plugins-noise-cancellation~=0.2",
Session/Room/Call IDs
Don't think this matters cuz I was testing in the console.
Proposed Solution
If we see lines 831-836 of agent_activity.py
elif isinstance(self.llm, llm.LLM):
# instructions used inside generate_reply are "extra" instructions.
# this matches the behavior of the Realtime API:
# https://platform.openai.com/docs/api-reference/realtime-client-events/response/create
if instructions:
instructions = "\n".join([self._agent.instructions, instructions])
We see that the instructions are appended for the AudioRecognition implementation with a comment that says that the realtime API behaves the same way.
The realtime version's generate_reply() is sending the instructions as is in the response.create event:
def generate_reply(
self, *, instructions: NotGivenOr[str] = NOT_GIVEN
) -> asyncio.Future[llm.GenerationCreatedEvent]:
event_id = utils.shortuuid("response_create_")
fut = asyncio.Future[llm.GenerationCreatedEvent]()
self._response_created_futures[event_id] = fut
self.send_event(
ResponseCreateEvent(
type="response.create",
event_id=event_id,
response=RealtimeResponseCreateParams(
instructions=instructions or None,
metadata={"client_event_id": event_id},
),
)
)
I think we could append instructions to self._instructions before sending to make the behavior the same. Also, as a plus point we could add a parameter like append: bool which by default is set to true which will append in both the normal and realtime cases and can replace if it's set to false
Additional Context
No response
Screenshots and Recordings
No response