openinference icon indicating copy to clipboard operation
openinference copied to clipboard

🗺️ Vision / multi-modal

Open mikeldking opened this issue 1 year ago • 1 comments

GPT 4o introduces a new message type that contains images and coded as either URL or base64 encoded.

example:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

https://platform.openai.com/docs/guides/vision

Milestone 1

  • Vision support in instrumentations python for llama-index, openai, gemini, and langchain
  • Eliminate performance degradations from base64 encoded payloads by allowing users to opt out
  • Preliminary set of config flags to mask input output that could be sensitive info
  • Create examples

Milestone N

  • image synthesis apis such as DALL-E

Tracing

  • [x] #522
  • [x] #523
  • [x] #539
  • [ ] #562
  • [x] #582
  • [x] #538
  • [ ] [vision] [javascript] langchain image messages parsing
  • [x] #557
  • [x] #560
  • [ ] [multi-modal] scope out video / audio semantic conventions
  • [x] #567

Instrumenation

Testing

  • [x] #872

Image tracing

  • [x] #707
  • [x] #708
  • [x] #709
  • [x] #710
  • [ ] #711
  • [x] #631
  • [x] #712
  • [x] #713
  • [x] #714
  • [x] #715
  • [x] #716
  • [x] #717

Context Attributes

  • [x] #718
  • [x] #719
  • [x] #720
  • [x] #721
  • [x] #722
  • [x] #723
  • [x] #724
  • [x] #725
  • [x] #726
  • [x] #727
  • [x] #728
  • [x] #729

Config

  • [x] #730
  • [x] #731
  • [x] #733
  • [x] #732
  • [x] #734
  • [x] #633
  • [x] #737
  • [x] #632
  • [x] #736
  • [x] #634
  • [x] #635
  • [x] #735

Suppress Tracing

  • [x] #748
  • [x] #749

UI / Javascript

  • [x] #568
  • [x] #704
  • [x] #821
  • [x] #956
  • [ ] [vision] instrumentation for langchain-js

Testing

  • [ ] #558

Documentation

  • [x] #561
  • [x] #786
  • [x] #787
  • [ ] #788
  • [ ] #833

Evals

  • [ ] #574

mikeldking avatar May 23 '24 08:05 mikeldking

Example vLLM client that should also support vision

class VLMClient:
    def __init__(self, vlm_model: str = VLM_MODEL, vllm_url: str = VLLM_URL):
        self._vlm_model = vlm_model
        self._vllm_client = httpx.AsyncClient(base_url=vllm_url)

        if VLLM_HEALTHCHECK:
            wait_for_ready(
                server_url=vllm_url,
                wait_seconds=VLLM_READY_TIMEOUT,
                health_endpoint="health",
            )

    @property
    def vlm_model(self) -> str:
        return self._vlm_model

    async def __call__(
        self,
        prompt: str,
        image_bytes: bytes | None = None,
        image_filetype: filetype.Type | None = None,
        max_tokens: int = 10,
    ) -> str:
        # Assemble the message content
        message_content: list[dict[str, str | dict]] = [
            {
                "type": "text",
                "text": prompt,
            }
        ]

        if image_bytes is not None:
            if image_filetype is None:
                image_filetype = filetype.guess(image_bytes)

            if image_filetype is None:
                raise ValueError("Could not determine image filetype")

            if image_filetype not in ALLOWED_IMAGE_TYPES:
                raise ValueError(
                    f"Image type {image_filetype} is not supported. Allowed types: {ALLOWED_IMAGE_TYPES}"
                )

            image_b64 = base64.b64encode(image_bytes).decode("utf-8")
            message_content.append(
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{image_filetype.mime};base64,{image_b64}",
                    },
                }
            )

        # Put together the request payload
        payload = {
            "model": self.vlm_model,
            "messages": [{"role": "user", "content": message_content}],
            "max_tokens": max_tokens,
            # "logprobs": True,
            # "top_logprobs": 1,
        }

        response = await self._vllm_client.post("/v1/chat/completions", json=payload)
        response = response.json()
        response_text: str = (
            response.get("choices")[0].get("message", {}).get("content", "").strip()
        )

        return response_text

mikeldking avatar May 23 '24 15:05 mikeldking

Closing as completed as images is complete. Audio will come as part of openAI realtime instrumentation

mikeldking avatar Dec 06 '24 00:12 mikeldking