autogen Catch token count issue while streaming with customized models

If llama, llava, phi, or some other models are used for streaming (with stream=True), the current design would crash after fetching the response.

A warning is enough in this case, just like the non-streaming use cases.

Why are these changes needed?

Related issue number

Checks

[ ] I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
[ ] I've added tests (if relevant) corresponding to the changes introduced in this PR.
[ ] I've made sure all auto checks have passed.

Jul 28 '24 23:07 BeibinLi

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 21.29%. Comparing base (6aaa238) to head (7d1a110). Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
autogen/oai/client.py	0.00%	5 Missing :warning:

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3241       +/-   ##
===========================================
- Coverage   33.24%   21.29%   -11.95%     
===========================================
  Files          99       99               
  Lines       11016    11020        +4     
  Branches     2365     2537      +172     
===========================================
- Hits         3662     2347     -1315     
- Misses       7026     8507     +1481     
+ Partials      328      166      -162

Flag	Coverage Δ
unittests	`21.26% <0.00%> (-11.99%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jul 28 '24 23:07 codecov-commenter

Hey @BeibinLi, can you share a llm config that is crashing? Is it using the standard openai client class?

Jul 29 '24 19:07 marklysze

@marklysze Yes, let's say I am using Ollama with "stream=True". Here is a simple code to reproduce the error:

from autogen import AssistantAgent, UserProxyAgent

config_list = [
    {
        "model": "llama3.1:70b",
        "api_key": "ollama",
        "base_url": "http://127.0.0.1:13579/v1"
    }
]
llm_config = {"config_list": config_list, "stream": True}
assistant = AssistantAgent(name="assistant", llm_config=llm_config)
user_proxy = UserProxyAgent(name="user", human_input_mode="NEVER", max_consecutive_auto_reply=1)
chat_res = user_proxy.initiate_chat(assistant, message="How are you")

Jul 30 '24 17:07 BeibinLi

@marklysze Yes, let's say I am using Ollama with "stream=True". Here is a simple code to reproduce the error:

Thanks @BeibinLi, I changed the config a bit (api_key to api_type, and I can't run 70b so running 8b) to use the Ollama client from PR #3056:

from autogen import AssistantAgent, UserProxyAgent

config_list = [
    {
        "model": "llama3.1:8b-instruct-q8_0",
        "api_type": "ollama",
        "client_host": "http://192.168.0.115:11434",
    }
]
llm_config = {"config_list": config_list, "stream": True}
assistant = AssistantAgent(name="assistant", llm_config=llm_config)
user_proxy = UserProxyAgent(name="user", human_input_mode="NEVER", max_consecutive_auto_reply=1)
chat_res = user_proxy.initiate_chat(assistant, message="How are you")

And it runs through okay for me.

For your original config, is that trying to use Ollama with the default client?

Jul 30 '24 20:07 marklysze

@marklysze Yes, I was using the original client, and your "api_type" hack works. Would it also work for LM Studio or other local hosts?

Jul 30 '24 22:07 BeibinLi

@marklysze Yes, I was using the original client, and your "api_type" hack works. Would it also work for LM Studio or other local hosts?

@BeibinLi, I don't think the Ollama REST API is fully compatible with the OpenAI API one. The Ollama PR #3056 uses the Ollama python library instead.

So, I'm not surprised the AutoGen default client will fail when trying to use Ollama's REST API... do you think we should try to cater for this and catch the error? I'm thinking we can steer people to use the Ollama client class (e.g. pip install pyautogen[ollama]) when it's ready.

Jul 30 '24 22:07 marklysze

We don't have 01/Yi/LM Studio/TogetherAI and many other customized clients provided, and they all go through the classic oai client by default. Unless we want to reroute these traffics all to the Ollama client, the developers have to handle the stream issue themselves. Alternatively, it is also ok to leave the exception to the developers for them to create their own clients.

@sonichi @qingyun-wu What do you think about this design issue.

Jul 30 '24 23:07 BeibinLi

We don't have 01/Yi/LM Studio/TogetherAI and many other customized clients provided, and they all go through the classic oai client by default. Unless we want to reroute these traffics all to the Ollama client, the developers have to handle the stream issue themselves. Alternatively, it is also ok to leave the exception to the developers for them to create their own clients.

@sonichi @qingyun-wu What do you think about this design issue.

Yes, I wouldn't recommend using the Ollama client for anything other than Ollama (because it will have its own idiosyncrasies). Just a note we do have a Together.AI client class but, you are right, anything we don't have will go through the default OAI one.

Aug 01 '24 04:08 marklysze

autogen autogen copied to clipboard

Catch token count issue while streaming with customized models

Why are these changes needed?

Related issue number

Checks

Codecov Report

autogen
autogen copied to clipboard