[Roadmap] Streaming support
stream messages from agent's reply.
### Tasks
- [ ] https://github.com/microsoft/autogen/issues/633
- [ ] Move print out of client.py for streaming
- [ ] Add support to streaming function call
- [ ] https://github.com/microsoft/autogen/pull/1551
No one is working on this issue as far as I know. Any volunteer to take a lead?
A bit of criticism... Is Streaming needed?
It doesn't seem AutoGen is a user facing tool with UI/UX being a top priority, while streaming mainly solves user experience concerns (interactivity and visible progress). After all the performance (total time to complete the request) is that same no matter is streaming is enabled or not.
On the contrary it seems that implementing streaming is a large piece of work and many parts will be affected and hard to maintain:
- Working with chunks in async matter might require creating 2 flavours of all APIs that use completions
- Cost accounting will be broken right away, streaming APIs (at least from OpenAI) don't return token stats for stream responses. OpenAI suggests to utilise tiktoken to do own accounting, though my experience with tiktoken says that there will alway be ~1% discrepancy with what OpenAI returns - costing will become less accurate.
IMO, it is a Large piece of work for a Small value (speaking in terms of S-M-L sizing and prioritizing).
@sonichi im actually working on this already. Happy to pick this up
Thanks. One thing to pay attention to is #203 . If your work uses the streaming feature from openai, better target the newer version. I'm currently working on #203 without the streaming part.
@sonichi I've also done some work to enable messages to be sent over sockets rather than printing in a terminal. Do you think that is something that I should open a PR for? Would that be useful?
That sounds useful because I've heard many different people trying to do that. @victordibia @AaronWard may be interested and feel free to pull others who are also interested.
ok sounds good. I'll start a new issue and push the changes I have so far. @victordibia @AaronWard please refer to #394
I replied on #394 . One thing to note here is that the work by @ragyabraham is more focused on streaming completed responses from each agent within an active conversation, not directly streaming the tokens from each agent as they are generated by an llm. The later is more involved and has unclear use cases/benefits (as mentioned by @maxim-saplin above). @ragyabraham kindly confirm that this is your focus here?
Hi @victordibia, I intend to utilise streaming to chunk all responses from the LLM. The approach we are thinking of is:
- we chunk the response and utilise some sort of messaging framework to emit messages to a server (e.g. socketio that sends messages to the FE)
- chunks are aggregated in memory (e.g.
string += chunk) - once all chunks have been consumed the complete message can be sent to the intended team member/recipient
Please let me know what you think
Hi!,
I've just created PR #465 to introduce streaming support in a straightforward and non-intrusive manner.
Usage:
llm_config={
"config_list": config_list,
# Enable, disable streaming (defaults to False)
"stream": True,
}
assistant = autogen.AssistantAgent(
name="assistant",
llm_config=llm_config,
)
Please, feel free to review the code and make suggestions.
Hi @Alvaromah , thank you for your contribution, which has enabled autogen to stream in the terminal. However, I would like to ask if there's a way to support streaming simultaneously to an external output? I'm asking this because if autogen is integrated with other UI frameworks, it would be desirable to see a streaming effect. I've tried modifying some parts of the source code to use 'yield', but it doesn't seem to have made any difference. Thank you for your help. π
You'll need to use websockets
Hi!,
I've just created PR #465 to introduce streaming support in a straightforward and non-intrusive manner.
Usage:
llm_config={ "config_list": config_list, # Enable, disable streaming (defaults to False) "stream": True, } assistant = autogen.AssistantAgent( name="assistant", llm_config=llm_config, )Please, feel free to review the code and make suggestions.
Instead of just stream: True, maybe also allow stream: callable, where True would point to sys.stdout.write by default. It is a small change that would make a big difference.
@Joaoprcf Good suggestion! Please feel free to make a PR and add @Alvaromah @ragyabraham as a reviewer.
Hi!, I've just created PR #465 to introduce streaming support in a straightforward and non-intrusive manner. Usage:
llm_config={ "config_list": config_list, # Enable, disable streaming (defaults to False) "stream": True, } assistant = autogen.AssistantAgent( name="assistant", llm_config=llm_config, )Please, feel free to review the code and make suggestions.
Instead of just stream: True, maybe also allow stream: callable, where True would point to sys.stdout.write by default. It is a small change that would make a big difference.
I think it best to add a new parameter. Let's not introduce a weird boolean that isn't a boolean. What we should probably have is:
llm_config={
"config_list": config_list,
"stream": True,
"response_callback": my_cb_func
}
which would fire for both stream: True (chunks) and stream: False (full message). I don't think there's a need to separate them since multiple chunks give you the full message anyway.
Note: must ensure that the finished message (stop_reason or whatever the return model calls it) is always passed to the callback.
Also, see #1118 about function streams.
Any idea when these PRs can land?
@matbee-eth if you'd like to help accelerate it, please participate in #1551
It would be great to have the steaming enabled so that for end-user production applications, UX will be better.
A bit of criticism... Is Streaming needed?
It doesn't seem AutoGen is a user facing tool with UI/UX being a top priority, while streaming mainly solves user experience concerns (interactivity and visible progress). After all the performance (total time to complete the request) is that same no matter is streaming is enabled or not.
On the contrary it seems that implementing streaming is a large piece of work and many parts will be affected and hard to maintain:
- Working with chunks in async matter might require creating 2 flavours of all APIs that use completions
- Cost accounting will be broken right away, streaming APIs (at least from OpenAI) don't return token stats for stream responses. OpenAI suggests to utilise tiktoken to do own accounting, though my experience with tiktoken says that there will alway be ~1% discrepancy with what OpenAI returns - costing will become less accurate.
IMO, it is a Large piece of work for a Small value (speaking in terms of S-M-L sizing and prioritizing).
It's a big value for end users of the Agents
It would be great to have the steaming enabled so that for end-user production applications, UX will be better.
@vinodvarma24 streaming via websockets is implemented in #1551, please take a look at it and let us know what you think
I had a basic question - I see this can stream the actual interactions between the agents. Does it also stream the actual response from the LLMs? And is the way to do it the same for both?
@vistaarjuneja,
Are you using the updated API?
For example, see the documentation here on how to stream both agent response AND LLM response.
here https://microsoft.github.io/autogen/dev/user-guide/agentchat-user-guide/tutorial/agents.html#streaming-messages
thanks @victordibia - is this only for 0.4? Is there some way to do this in 0.2 as well?
The 0.2 architecture is not well suited to this. consider using 0.4. Itβs a single line of code to do this https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/tutorial/agents.html#streaming-tokens