gemini-openai-proxy icon indicating copy to clipboard operation
gemini-openai-proxy copied to clipboard

[Bug] Does not support single str for vision or multiple texts input for non-vision

Open ZihaoZhou opened this issue 6 months ago • 1 comments

First thank @zhu327 and @ekatiyar for your great works.

I am using this fork of repository and notice that

curl http://localhost:8080/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $GOOGLE_API_KEY" \
 -d '{
     "model": "gemini-1.5-vision-latest",
     "messages": [{"role": "user", "content": "Say this is a test."}],
     "temperature": 0.7
 }'
{"code":400,"message":"message.multiContent: json.Unmarshal: json: cannot unmarshal string into Go value of type []openai.ChatMessagePart","type":""}

does not work, because the proxy forces the vision input to have multiple parts. But in practice, both openai and gemini vision models can accept input without images. I also notice that

curl http://localhost:8080/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $GOOGLE_API_KEY" \
 -d '{
     "model": "gemini-1.0-pro-latest",
     "messages": [{"role": "user", "content": [
        {"type": "text", "text": "Paraphrase this sentence."},
        {"type": "text", "text": "Say this is a test."}
     ]}],
     "temperature": 0.7
 }'

does not work, because the proxy forces the text input to have single part. But in practice, both openai and gemini models accept multiple text parts.

Since gemini-1.0-pro-vision was already deprecated last month, I don't see why we cannot simply aggregate toStringGenaiContent() and toVisionGenaiContent() to handle all forms of input. I implemented it in my fork and it solves both bugs. It also avoid the extra environment variable to toggle between gemini flash and pro vision.

ZihaoZhou avatar Aug 07 '24 20:08 ZihaoZhou