openinference
openinference copied to clipboard
🗺️ Vision / multi-modal
GPT 4o introduces a new message type that contains images and coded as either URL or base64 encoded.
example:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0])
https://platform.openai.com/docs/guides/vision
Milestone 1
- Vision support in instrumentations python for llama-index, openai, gemini, and langchain
- Eliminate performance degradations from base64 encoded payloads by allowing users to opt out
- Preliminary set of config flags to mask input output that could be sensitive info
- Create examples
Milestone N
- image synthesis apis such as DALL-E
Tracing
- [x] #522
- [x] #523
- [x] #539
- [ ] #562
- [x] #582
- [x] #538
- [ ] [vision] [javascript] langchain image messages parsing
- [x] #557
- [x] #560
- [ ] [multi-modal] scope out video / audio semantic conventions
- [x] #567
Instrumenation
Testing
- [x] #872
Image tracing
- [x] #707
- [x] #708
- [x] #709
- [x] #710
- [ ] #711
- [x] #631
- [x] #712
- [x] #713
- [x] #714
- [x] #715
- [x] #716
- [x] #717
Context Attributes
- [x] #718
- [x] #719
- [x] #720
- [x] #721
- [x] #722
- [x] #723
- [x] #724
- [x] #725
- [x] #726
- [x] #727
- [x] #728
- [x] #729
Config
- [x] #730
- [x] #731
- [x] #733
- [x] #732
- [x] #734
- [x] #633
- [x] #737
- [x] #632
- [x] #736
- [x] #634
- [x] #635
- [x] #735
Suppress Tracing
- [x] #748
- [x] #749
UI / Javascript
- [x] #568
- [x] #704
- [x] #821
- [x] #956
- [ ] [vision] instrumentation for langchain-js
Testing
- [ ] #558
Documentation
- [x] #561
- [x] #786
- [x] #787
- [ ] #788
- [ ] #833
Evals
- [ ] #574
Example vLLM client that should also support vision
class VLMClient:
def __init__(self, vlm_model: str = VLM_MODEL, vllm_url: str = VLLM_URL):
self._vlm_model = vlm_model
self._vllm_client = httpx.AsyncClient(base_url=vllm_url)
if VLLM_HEALTHCHECK:
wait_for_ready(
server_url=vllm_url,
wait_seconds=VLLM_READY_TIMEOUT,
health_endpoint="health",
)
@property
def vlm_model(self) -> str:
return self._vlm_model
async def __call__(
self,
prompt: str,
image_bytes: bytes | None = None,
image_filetype: filetype.Type | None = None,
max_tokens: int = 10,
) -> str:
# Assemble the message content
message_content: list[dict[str, str | dict]] = [
{
"type": "text",
"text": prompt,
}
]
if image_bytes is not None:
if image_filetype is None:
image_filetype = filetype.guess(image_bytes)
if image_filetype is None:
raise ValueError("Could not determine image filetype")
if image_filetype not in ALLOWED_IMAGE_TYPES:
raise ValueError(
f"Image type {image_filetype} is not supported. Allowed types: {ALLOWED_IMAGE_TYPES}"
)
image_b64 = base64.b64encode(image_bytes).decode("utf-8")
message_content.append(
{
"type": "image_url",
"image_url": {
"url": f"data:{image_filetype.mime};base64,{image_b64}",
},
}
)
# Put together the request payload
payload = {
"model": self.vlm_model,
"messages": [{"role": "user", "content": message_content}],
"max_tokens": max_tokens,
# "logprobs": True,
# "top_logprobs": 1,
}
response = await self._vllm_client.post("/v1/chat/completions", json=payload)
response = response.json()
response_text: str = (
response.get("choices")[0].get("message", {}).get("content", "").strip()
)
return response_text
Closing as completed as images is complete. Audio will come as part of openAI realtime instrumentation