litellm
litellm copied to clipboard
[Bug]: groq models do not support streaming when in JSON mode
What happened?
It appears that with LiteLLM version 1.35.38 (I have not upgraded to the latest b/c of other issues with Ollama JSON mode), I am unable to use groq models with JSON mode with streaming. I have a minimal notebook that reproduces this issue on GitHub gist: https://gist.github.com/ericmjl/6f3e2cbbfcf26a8f3334a58af6a76f63
Relevant log output
You can find the notebook here: https://gist.github.com/ericmjl/6f3e2cbbfcf26a8f3334a58af6a76f63
Twitter / LinkedIn details
@ericmjl
On the latest version I get this error @ericmjl - would you expect litellm to fake the streaming response ?
GroqException - Error code: 400 - {'error': {'message': 'response_format` does not support streaming', 'type': 'invalid_request_error'}}
@ishaan-jaff thinking about the problem from your perspective as a library maintainer, faking the streaming response might be good for the LiteLLM user experience but it'd also be adding a special case for you all to handle. I would love to see the streaming response faked (Groq is fast enough that for all practical purposes, just waiting for groq to return the full text is almost as good as seeing the streaming response), though I am cognizant of the extra burden it might put on you guys.
I am not able to get litellm to send groq response_format at all. Have you run into that issue as well (streaming aside) @ericmjl ?
what error do you see when sending response_format @misterfancysocks ?
are you on the latest litellm version ?
Hey @ishaan-jaff
I'm not sure where to figure out which version of litellm (docker) i'm using, but here is the info for the image:
ghcr.io/berriai/litellm main-stable 96ca897120c4 3 weeks ago 1.37GB
Here is my code:
import warnings warnings.filterwarnings("ignore", category=UserWarning, module="pydantic") import litellm from litellm import completion from dotenv import load_dotenv import os load_dotenv(os.path.expanduser('~/code/consumio/consumioish/.env')) # os.environ['LITELLM_LOG'] = 'DEBUG' litellm.set_verbose = True litellm.api_base = "http://localhost:4000" litellm.api_key = os.getenv("LITELLM_API_KEY") litellm.success_callback = ["langfuse"] ## set ENV variables response = completion( model="groq/llama3-8b-8192", messages=[{"role": "user", "content": "hows it going? "}], response_format={"type": "json_object"}, # stream=False ) print(response.choices[0].message.content)
What I'm seeing is that when I submit a request with 'response_format' it will acknowledge in the logs that i've requested it but the actual curl request will not include this.
Request to litellm: litellm.completion(model='groq/llama3-8b-8192', messages=[{'role': 'user', 'content': 'hows it going? '}], response_format={'type': 'json_object'}) 18:49:27 - LiteLLM:WARNING: utils.py:316 - `litellm.set_verbose` is deprecated. Please set `os.environ['LITELLM_LOG'] = 'DEBUG'` for debug logs. SYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache')['no-cache']: False Final returned optional params: {'extra_body': {}} POST Request Sent from LiteLLM: curl -X POST \ https://api.groq.com/openai/v1/chat/completions \ -H 'Content-Type: *****' -H 'Authorization: Bearer gsk_VAWbOVuF********************************************' \ -d '{'model': 'llama3-8b-8192', 'messages': [{'role': 'user', 'content': 'hows it going? '}], 'stream': False}' RAW RESPONSE: {"id": "chatcmpl-7e51986f-c022-4eaa-9f8b-42f7afbdc6fb", "object": "chat.completion", "created": 1736038167, "model": "llama3-8b-8192", "choices": [{"index": 0, "message": {"role": "assistant", "content": "I'm just an AI, I don't have feelings or emotions like humans do, but I'm functioning properly and ready to assist you with any questions or tasks you may have! How can I help you today?"}, "logprobs": null, "finish_reason": "stop"}], "usage": {"queue_time": 0.018435816, "prompt_tokens": 16, "prompt_time": 0.002348172, "completion_tokens": 44, "completion_time": 0.036666667, "total_tokens": 60, "total_time": 0.039014839}, "system_fingerprint": "fp_a97cfe35ae", "x_groq": {"id": "req_01jg******************"}} Returned custom cost for model=groq/llama3-8b-8192 - prompt_tokens_cost_usd_dollar: 8e-07, completion_tokens_cost_usd_dollar: 3.52e-06 reaches langfuse for success logging! Returned custom cost for model=groq/llama3-8b-8192 - prompt_tokens_cost_usd_dollar: 8e-07, completion_tokens_cost_usd_dollar: 3.52e-06 I'm just an AI, I don't have feelings or emotions like humans do, but I'm functioning properly and ready to assist you with any questions or tasks you may have! How can I help you today?
I see the issue, we handle structured output for groq by leveraging their tool calling. our test missed the json mode scenario
here's the issue - https://github.com/BerriAI/litellm/blob/d74fa394543df9b38eec7ee9b0b6e440e3f2db07/litellm/llms/groq/chat/transformation.py#L153
will push a fix asap
I see the issue, we handle structured output for groq by leveraging their tool calling. our test missed the json mode scenario
here's the issue -
https://github.com/BerriAI/litellm/blob/d74fa394543df9b38eec7ee9b0b6e440e3f2db07/litellm/llms/groq/chat/transformation.py#L153
will push a fix asap
You are the best, thank you!
@krrishdholakia I just updated to 1.56.10 and it didn't work for me. Looking at the diff, it looks like there was just a test that was added.
It's not on v1.56.10. The fix is on main. Will be on v1.57.0
It's not on v1.56.10. The fix is on main. Will be on v1.57.0
Got it. Do you have a rough ETA?
Should be out today hopefully. I believe we were just seeing some vertex rate limit errors causing the test to fail
hey @krrishdholakia did this ever get deployed to the stable or latest images?
@p-c-mo Yes this looks like it was merged in a while ago. Are you still seeing this issue?
@krrishdholakia I feel dumb asking this but I am running this via docker-compose and cant figure out how to see the request headers that the litellm proxy is sending.
following up @krrishdholakia