gemini-openai-proxy
gemini-openai-proxy copied to clipboard
[Bug] Does not support single str for vision or multiple texts input for non-vision
First thank @zhu327 and @ekatiyar for your great works.
I am using this fork of repository and notice that
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GOOGLE_API_KEY" \
-d '{
"model": "gemini-1.5-vision-latest",
"messages": [{"role": "user", "content": "Say this is a test."}],
"temperature": 0.7
}'
{"code":400,"message":"message.multiContent: json.Unmarshal: json: cannot unmarshal string into Go value of type []openai.ChatMessagePart","type":""}
does not work, because the proxy forces the vision input to have multiple parts. But in practice, both openai and gemini vision models can accept input without images. I also notice that
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GOOGLE_API_KEY" \
-d '{
"model": "gemini-1.0-pro-latest",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Paraphrase this sentence."},
{"type": "text", "text": "Say this is a test."}
]}],
"temperature": 0.7
}'
does not work, because the proxy forces the text input to have single part. But in practice, both openai and gemini models accept multiple text parts.
Since gemini-1.0-pro-vision was already deprecated last month, I don't see why we cannot simply aggregate toStringGenaiContent() and toVisionGenaiContent() to handle all forms of input. I implemented it in my fork and it solves both bugs. It also avoid the extra environment variable to toggle between gemini flash and pro vision.