autogen [Roadmap] Streaming support

stream messages from agent's reply.

### Tasks
- [ ] https://github.com/microsoft/autogen/issues/633
- [ ] Move print out of client.py for streaming
- [ ] Add support to streaming function call
- [ ] https://github.com/microsoft/autogen/pull/1551

Oct 12 '23 21:10 sonichi

No one is working on this issue as far as I know. Any volunteer to take a lead?

Oct 22 '23 15:10 sonichi

A bit of criticism... Is Streaming needed?

It doesn't seem AutoGen is a user facing tool with UI/UX being a top priority, while streaming mainly solves user experience concerns (interactivity and visible progress). After all the performance (total time to complete the request) is that same no matter is streaming is enabled or not.

On the contrary it seems that implementing streaming is a large piece of work and many parts will be affected and hard to maintain:

Working with chunks in async matter might require creating 2 flavours of all APIs that use completions
Cost accounting will be broken right away, streaming APIs (at least from OpenAI) don't return token stats for stream responses. OpenAI suggests to utilise tiktoken to do own accounting, though my experience with tiktoken says that there will alway be ~1% discrepancy with what OpenAI returns - costing will become less accurate.

IMO, it is a Large piece of work for a Small value (speaking in terms of S-M-L sizing and prioritizing).

Oct 23 '23 09:10 maxim-saplin

@sonichi im actually working on this already. Happy to pick this up

Oct 23 '23 23:10 ragyabraham

Thanks. One thing to pay attention to is #203 . If your work uses the streaming feature from openai, better target the newer version. I'm currently working on #203 without the streaming part.

Oct 23 '23 23:10 sonichi

@sonichi I've also done some work to enable messages to be sent over sockets rather than printing in a terminal. Do you think that is something that I should open a PR for? Would that be useful?

Oct 23 '23 23:10 ragyabraham

That sounds useful because I've heard many different people trying to do that. @victordibia @AaronWard may be interested and feel free to pull others who are also interested.

Oct 24 '23 00:10 sonichi

ok sounds good. I'll start a new issue and push the changes I have so far. @victordibia @AaronWard please refer to #394

Oct 24 '23 01:10 ragyabraham

I replied on #394 . One thing to note here is that the work by @ragyabraham is more focused on streaming completed responses from each agent within an active conversation, not directly streaming the tokens from each agent as they are generated by an llm. The later is more involved and has unclear use cases/benefits (as mentioned by @maxim-saplin above). @ragyabraham kindly confirm that this is your focus here?

Oct 24 '23 01:10 victordibia

Hi @victordibia, I intend to utilise streaming to chunk all responses from the LLM. The approach we are thinking of is:

we chunk the response and utilise some sort of messaging framework to emit messages to a server (e.g. socketio that sends messages to the FE)
chunks are aggregated in memory (e.g. string += chunk)
once all chunks have been consumed the complete message can be sent to the intended team member/recipient

Please let me know what you think

Oct 24 '23 02:10 ragyabraham

Hi!,

I've just created PR #465 to introduce streaming support in a straightforward and non-intrusive manner.

Usage:

llm_config={
    "config_list": config_list,
    # Enable, disable streaming (defaults to False)
    "stream": True,
}

assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)

Please, feel free to review the code and make suggestions.

Oct 28 '23 18:10 Alvaromah

Hi @Alvaromah , thank you for your contribution, which has enabled autogen to stream in the terminal. However, I would like to ask if there's a way to support streaming simultaneously to an external output? I'm asking this because if autogen is integrated with other UI frameworks, it would be desirable to see a streaming effect. I've tried modifying some parts of the source code to use 'yield', but it doesn't seem to have made any difference. Thank you for your help. 😀

Dec 20 '23 01:12 lianghsun

You'll need to use websockets

Dec 20 '23 05:12 ragyabraham

Hi!,

I've just created PR #465 to introduce streaming support in a straightforward and non-intrusive manner.

Usage:
llm_config={
    "config_list": config_list,
    # Enable, disable streaming (defaults to False)
    "stream": True,
}

assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)
Please, feel free to review the code and make suggestions.

Instead of just stream: True, maybe also allow stream: callable, where True would point to sys.stdout.write by default. It is a small change that would make a big difference.

Dec 21 '23 17:12 Joaoprcf

@Joaoprcf Good suggestion! Please feel free to make a PR and add @Alvaromah @ragyabraham as a reviewer.

Jan 01 '24 02:01 sonichi

Hi!, I've just created PR #465 to introduce streaming support in a straightforward and non-intrusive manner. Usage:
llm_config={
    "config_list": config_list,
    # Enable, disable streaming (defaults to False)
    "stream": True,
}

assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config=llm_config,
)
Please, feel free to review the code and make suggestions.
Instead of just stream: True, maybe also allow stream: callable, where True would point to sys.stdout.write by default. It is a small change that would make a big difference.

I think it best to add a new parameter. Let's not introduce a weird boolean that isn't a boolean. What we should probably have is:

llm_config={
    "config_list": config_list,
    "stream": True,
    "response_callback": my_cb_func
}

which would fire for both stream: True (chunks) and stream: False (full message). I don't think there's a need to separate them since multiple chunks give you the full message anyway.

Note: must ensure that the finished message (stop_reason or whatever the return model calls it) is always passed to the callback.

Also, see #1118 about function streams.

Jan 02 '24 13:01 bitnom

Any idea when these PRs can land?

Jan 31 '24 21:01 matbeedotcom

@matbee-eth if you'd like to help accelerate it, please participate in #1551

Feb 11 '24 19:02 sonichi

It would be great to have the steaming enabled so that for end-user production applications, UX will be better.

Mar 23 '24 23:03 vinodvarma24

A bit of criticism... Is Streaming needed?

It doesn't seem AutoGen is a user facing tool with UI/UX being a top priority, while streaming mainly solves user experience concerns (interactivity and visible progress). After all the performance (total time to complete the request) is that same no matter is streaming is enabled or not.

On the contrary it seems that implementing streaming is a large piece of work and many parts will be affected and hard to maintain:

Working with chunks in async matter might require creating 2 flavours of all APIs that use completions

Cost accounting will be broken right away, streaming APIs (at least from OpenAI) don't return token stats for stream responses. OpenAI suggests to utilise tiktoken to do own accounting, though my experience with tiktoken says that there will alway be ~1% discrepancy with what OpenAI returns - costing will become less accurate.

IMO, it is a Large piece of work for a Small value (speaking in terms of S-M-L sizing and prioritizing).

It's a big value for end users of the Agents

Mar 23 '24 23:03 vinodvarma24

It would be great to have the steaming enabled so that for end-user production applications, UX will be better.

@vinodvarma24 streaming via websockets is implemented in #1551, please take a look at it and let us know what you think

Mar 24 '24 14:03 davorrunje

I had a basic question - I see this can stream the actual interactions between the agents. Does it also stream the actual response from the LLMs? And is the way to do it the same for both?

Mar 03 '25 22:03 vistaarjuneja

@vistaarjuneja,
Are you using the updated API? For example, see the documentation here on how to stream both agent response AND LLM response.

here https://microsoft.github.io/autogen/dev/user-guide/agentchat-user-guide/tutorial/agents.html#streaming-messages

Mar 03 '25 22:03 victordibia

thanks @victordibia - is this only for 0.4? Is there some way to do this in 0.2 as well?

Mar 18 '25 05:03 vistaarjuneja

The 0.2 architecture is not well suited to this. consider using 0.4. It’s a single line of code to do this https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/tutorial/agents.html#streaming-tokens

Mar 18 '25 15:03 victordibia