autogen DRAFT for Feedback - Support for token streaming for more dynamic UX

Why are these changes needed?

ChatCompletionClient nicely supports token level streaming via create_stream, but this method is currently not accessible in the AssistantAgent. This proposed change adds an option to pass a token_callback when instantiating AssistantAgent, if provided:

create_stream will be leveraged instead of create when calling on_messages_stream
the provided callback will be called with the returned token as the argument.

This will allow the calling application access to the returned tokens real-time. Nothing else is changed, the normal returns to on_messages_streams are not affected.

Example: streaming_tokens

If folks feel this a good idea, I will make appropriate updates in documentation and tests.

Related issue number

Checks

[ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
[ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
[ ] I've made sure all auto checks have passed.

Dec 01 '24 14:12 jspv

@microsoft-github-policy-service agree

From: microsoft-github-policy-service[bot] @.> Date: Sunday, December 1, 2024 at 9:45 AM To: microsoft/autogen @.> Cc: jspv @.>, Mention @.> Subject: Re: [microsoft/autogen] DRAFT for Feedback - Support for token streaming for more dynamic UX (PR #4443)

@jspvhttps://github.com/jspv please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

[...]

Dec 01 '24 16:12 jspv

@jspv thanks for the PR! We have an issue to track streaming output from agents: #3862 and #3983. The general idea is to stream partial messages through the async iterator from on_messages_stream. Do you think that approach can meet your need? We haven't started working on that yet.

Dec 02 '24 19:12 ekzhu

@jspv thanks for the PR! We have an issue to track streaming output from agents: #3862 and #3983. The general idea is to stream partial messages through the async iterator from on_messages_stream. Do you think that approach can meet your need? We haven't started working on that yet.

Yes, that would work. I considered that approach but as it be a potentially breaking change, I avoided it for my testing. Happy to take a stab at it. The callers pulling from the iterator would need to be able to differentiate between tokens being returned vs. other types of messages coming back (tool call, etc.); it could be as simple as type str for tokens and Response for non-Tokens, or are you thinking a different response type for Tokens?

As callers of on_messages_stream may want the full results (call for tools, etc.) streamed back but not the tokens, how it currently works with the underlying client call model_client.create() vs. model_client.create_stream()), there would need to be a way to signal to on_messages_stream that token streaming is desired. e.g. stream_tokens = True to on_messages_stream?

Dec 04 '24 11:12 jspv

Thinking more on this. An advantage of the callback model vs. async iterator is that it works perfectly when invoking group chats, E.g. RoundRobinGroupChat and await self.agent_team.run(task=message). This way the desire for streaming is indicated part of the agent's instantiation and I can choose which agents I wish to receiving streamed tokens from and which ones I do not. When I call agent_team.run(task = message I get only the tokens I'm interested in via the callbacks. This already works with my minimal code; using the on_messages iterator would require a lot of rework for team/group chats.

What I think would make sense is to accept a list of callbacks (Langchain does this) on Agent instantiation, or alternatively create methods for registering and removing callbacks from the agent. If there are callbacks listed, the agent will use create_stream in on_messages_stream instead of create, and will call the callbacks on returned tokens with a structure that has the token and the calling agent.

Effectively this cleanly separates streamed tokens into their own path for getting to the UI for those who want them (as I really think this is only a UI need), and leaves all the 'normal' paths for group chats and inter-agent communications.

Thoughts?

Dec 04 '24 19:12 jspv

@jspv since this feature is targeting 0.4.1, do you want to join our discord channel so we can discuss? https://aka.ms/autogen-discord

Dec 09 '24 04:12 ekzhu

@jspv would you like to join our community office hours to discuss the changes you proposed here? See #4059

Dec 18 '24 18:12 ekzhu

@jspv Agent output should go via the runtime (message publishing). The reason this is important is so that cross process communication works as expected. The callback approach will only work in a single process.

While agentchat is only currently single process, we are expanding it to work with the same distributed expectations of core in an upcoming release. So, we will likely get to tackling partial message streaming in 0.4.1. Because of this, we don't want to add call backs to the AssistantAgent included by default agentchat.

However, in saying all this, if callbacks work well for you and the constraints I mentioned above don't apply to you then I would encourage you to use them! Given the modular architecture of 0.4 and support for custom agents it should be really easy for you do this. Essentially you'd just copy/paste AssistantAgent, make your changes, and use it with all of the agentchat classes without modification.

Dec 19 '24 18:12 jackgerrits

@jspv Agent output should go via the runtime (message publishing). The reason this is important is so that cross process communication works as expected. The callback approach will only work in a single process.

While agentchat is only currently single process, we are expanding it to work with the same distributed expectations of core in an upcoming release. So, we will likely get to tackling partial message streaming in 0.4.1. Because of this, we don't want to add call backs to the AssistantAgent included by default agentchat.

However, in saying all this, if callbacks work well for you and the constraints I mentioned above don't apply to you then I would encourage you to use them! Given the modular architecture of 0.4 and support for custom agents it should be really easy for you do this. Essentially you'd just copy/paste AssistantAgent, make your changes, and use it with all of the agentchat classes without modification.

Understood. Thanks for the feedback; happy to assist where I can. My thinking on high level requirements so far is:

Token streaming should be enabled/disabled as an option to the agent, not the team, as some agent's TextMessages may not be suitable for streaming (e.g. large blocks of text, structured non-conversational output, etc.)
Agent token streaming should be a toggleable property and not permanently set when agents are instantiated
Token streaming is primarily a UI feature; the streamed tokens are not relevant to chat history, model context, saved/loaded state, intra-agent messages, etc. All the information that is relevant to those is captured in existing messages. E.g. After any stream of tokens, when the completion is finished, the standard TextMessage message should be published that has the entire response to all the agents; other agents don't need to receive the token-by-token messages; really just the UI.
- This implies that a somewhat different message mechanism be created for streamed tokens, one that doesn't necessarily publish to all agents; but would still be awaitable to the calling application (e.g. exposed through the awaitable team.run_stream or agent.on_message_stream perhaps with an identifiable StreamedTokenMessage type or similar if that is the method of choice).

Does this seem reasonable? Is there a natural approach to modifying the message structure to support this? - I'm happy to prototype the change.

Dec 19 '24 22:12 jspv

@jspv we have a PR now for streaming tokens. Could you take a look? #5208

It does set as a permanent property of the agent, however, we can easily add a toggle to flip it.

Jan 28 '25 23:01 ekzhu

Just tested, worked very well. Agree, a toggle would be helpful. Will close this PR. Thanks!

Jan 29 '25 02:01 jspv

autogen autogen copied to clipboard

DRAFT for Feedback - Support for token streaming for more dynamic UX

Why are these changes needed?

Related issue number

Checks

autogen
autogen copied to clipboard