litellm [Bug]: groq models do not support streaming when in JSON mode

What happened?

It appears that with LiteLLM version 1.35.38 (I have not upgraded to the latest b/c of other issues with Ollama JSON mode), I am unable to use groq models with JSON mode with streaming. I have a minimal notebook that reproduces this issue on GitHub gist: https://gist.github.com/ericmjl/6f3e2cbbfcf26a8f3334a58af6a76f63

Relevant log output

You can find the notebook here: https://gist.github.com/ericmjl/6f3e2cbbfcf26a8f3334a58af6a76f63

Twitter / LinkedIn details

@ericmjl

Jul 20 '24 04:07 ericmjl

On the latest version I get this error @ericmjl - would you expect litellm to fake the streaming response ?

 GroqException - Error code: 400 - {'error': {'message': 'response_format` does not support streaming', 'type': 'invalid_request_error'}}

Jul 20 '24 20:07 ishaan-jaff

@ishaan-jaff thinking about the problem from your perspective as a library maintainer, faking the streaming response might be good for the LiteLLM user experience but it'd also be adding a special case for you all to handle. I would love to see the streaming response faked (Groq is fast enough that for all practical purposes, just waiting for groq to return the full text is almost as good as seeing the streaming response), though I am cognizant of the extra burden it might put on you guys.

Jul 24 '24 01:07 ericmjl

I am not able to get litellm to send groq response_format at all. Have you run into that issue as well (streaming aside) @ericmjl ?

Jan 04 '25 23:01 p-c-mo

what error do you see when sending response_format @misterfancysocks ?

are you on the latest litellm version ?

Jan 05 '25 00:01 ishaan-jaff

Hey @ishaan-jaff

I'm not sure where to figure out which version of litellm (docker) i'm using, but here is the info for the image:

ghcr.io/berriai/litellm        main-stable       96ca897120c4   3 weeks ago     1.37GB

Here is my code:

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="pydantic")

import litellm
from litellm import completion
from dotenv import load_dotenv
import os

load_dotenv(os.path.expanduser('~/code/consumio/consumioish/.env'))

# os.environ['LITELLM_LOG'] = 'DEBUG'
litellm.set_verbose = True
litellm.api_base = "http://localhost:4000"
litellm.api_key = os.getenv("LITELLM_API_KEY")
litellm.success_callback = ["langfuse"] 

## set ENV variables

response = completion(
  model="groq/llama3-8b-8192",
  messages=[{"role": "user", "content": "hows it going? "}],
  response_format={"type": "json_object"},
  # stream=False
)

print(response.choices[0].message.content)

What I'm seeing is that when I submit a request with 'response_format' it will acknowledge in the logs that i've requested it but the actual curl request will not include this.

Request to litellm:
litellm.completion(model='groq/llama3-8b-8192', messages=[{'role': 'user', 'content': 'hows it going? '}], response_format={'type': 'json_object'})


18:49:27 - LiteLLM:WARNING: utils.py:316 - `litellm.set_verbose` is deprecated. Please set `os.environ['LITELLM_LOG'] = 'DEBUG'` for debug logs.
SYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache')['no-cache']: False
Final returned optional params: {'extra_body': {}}


POST Request Sent from LiteLLM:
curl -X POST \
https://api.groq.com/openai/v1/chat/completions \
-H 'Content-Type: *****' -H 'Authorization: Bearer gsk_VAWbOVuF********************************************' \
-d '{'model': 'llama3-8b-8192', 'messages': [{'role': 'user', 'content': 'hows it going? '}], 'stream': False}'


RAW RESPONSE:
{"id": "chatcmpl-7e51986f-c022-4eaa-9f8b-42f7afbdc6fb", "object": "chat.completion", "created": 1736038167, "model": "llama3-8b-8192", "choices": [{"index": 0, "message": {"role": "assistant", "content": "I'm just an AI, I don't have feelings or emotions like humans do, but I'm functioning properly and ready to assist you with any questions or tasks you may have! How can I help you today?"}, "logprobs": null, "finish_reason": "stop"}], "usage": {"queue_time": 0.018435816, "prompt_tokens": 16, "prompt_time": 0.002348172, "completion_tokens": 44, "completion_time": 0.036666667, "total_tokens": 60, "total_time": 0.039014839}, "system_fingerprint": "fp_a97cfe35ae", "x_groq": {"id": "req_01jg******************"}}


Returned custom cost for model=groq/llama3-8b-8192 - prompt_tokens_cost_usd_dollar: 8e-07, completion_tokens_cost_usd_dollar: 3.52e-06
reaches langfuse for success logging!
Returned custom cost for model=groq/llama3-8b-8192 - prompt_tokens_cost_usd_dollar: 8e-07, completion_tokens_cost_usd_dollar: 3.52e-06
I'm just an AI, I don't have feelings or emotions like humans do, but I'm functioning properly and ready to assist you with any questions or tasks you may have! How can I help you today?

Jan 05 '25 01:01 p-c-mo

I see the issue, we handle structured output for groq by leveraging their tool calling. our test missed the json mode scenario

here's the issue - https://github.com/BerriAI/litellm/blob/d74fa394543df9b38eec7ee9b0b6e440e3f2db07/litellm/llms/groq/chat/transformation.py#L153

will push a fix asap

Jan 05 '25 02:01 krrishdholakia

I see the issue, we handle structured output for groq by leveraging their tool calling. our test missed the json mode scenario

here's the issue -

https://github.com/BerriAI/litellm/blob/d74fa394543df9b38eec7ee9b0b6e440e3f2db07/litellm/llms/groq/chat/transformation.py#L153

will push a fix asap

You are the best, thank you!

Jan 05 '25 03:01 p-c-mo

@krrishdholakia I just updated to 1.56.10 and it didn't work for me. Looking at the diff, it looks like there was just a test that was added.

Jan 05 '25 04:01 p-c-mo

It's not on v1.56.10. The fix is on main. Will be on v1.57.0

Jan 05 '25 06:01 krrishdholakia

It's not on v1.56.10. The fix is on main. Will be on v1.57.0

Got it. Do you have a rough ETA?

Jan 05 '25 15:01 p-c-mo

Should be out today hopefully. I believe we were just seeing some vertex rate limit errors causing the test to fail

Jan 05 '25 15:01 krrishdholakia

hey @krrishdholakia did this ever get deployed to the stable or latest images?

Feb 16 '25 04:02 p-c-mo

@p-c-mo Yes this looks like it was merged in a while ago. Are you still seeing this issue?

Feb 16 '25 05:02 krrishdholakia

@krrishdholakia I feel dumb asking this but I am running this via docker-compose and cant figure out how to see the request headers that the litellm proxy is sending.

Feb 17 '25 02:02 p-c-mo

following up @krrishdholakia

Feb 19 '25 02:02 p-c-mo

litellm litellm copied to clipboard

[Bug]: groq models do not support streaming when in JSON mode

What happened?

Relevant log output

Twitter / LinkedIn details

litellm
litellm copied to clipboard