outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Widen asset format supports

Open RobinPicard opened this issue 3 months ago • 4 comments

Currently an Image can only be instantiated with a PIL.Image for instance, but some models accept images as urls and users may want to give a binary. We should allow a wider range of formats.

RobinPicard avatar Aug 29 '25 09:08 RobinPicard

I could work on this after my current PR 1755 gets reviewed and accepted.

yasteven avatar Sep 22 '25 17:09 yasteven

Awesome! I think the most useful part is accepting a url for images, but then it means such images could not be used for local models, so there is a bit to change there.

RobinPicard avatar Oct 03 '25 07:10 RobinPicard

@RobinPicard Let me know your thoughts:

Issue #1742: Widen asset format supports

Current state: outlines.inputs.Image only accepts PIL.Image objects. Problem: remote multimodal APIs (OpenAI, Anthropic, Gemini) often accept URLs or binaries directly, while local models (transformers, llama.cpp, etc.) expect preprocessed image tensors.

Proposed solution:

  • Make the Image constructor polymorphic: Image(pil_image) // accept PIL.Image Image("https://") // accept URL Image(b"...") // accept raw bytes Image("/path.png") // accept local file path
  • Also provide explicit helpers (from_pil, from_url, from_bytes, from_path) for clarity when needed.
  • Proposed Behavior:
    • Remote backends: pass URL directly, or upload binary.
    • Local backends: if URL is given, auto-fetch and convert to PIL.Image. If bytes are given, open via PIL.Image.open(io.BytesIO(...)).
  • This keeps the API identical between local and remote models, with transparent handling of each input type.
  • Document/Changes: document that local backends may trigger implicit network calls when given a URL, to avoid surprising users.

Example changes to model integration for a local model:


    def _create_message(self, role: str, content: str | list) -> dict:
        """Create a message for Ollama, supporting multimodal inputs."""

        if isinstance(content, str):
            return {
                "role": role,
                "content": content,
            }

        elif isinstance(content, list):
            prompt = content[0]
            images = content[1:]

            if not all(isinstance(image, Image) for image in images):
                raise ValueError("All assets provided must be of type outlines.Image")

            serialized_images = []
            for image in images:
                if isinstance(image.data, str) and image.data.startswith(("http://", "https://")):
                    # For Ollama (offline), resolve the URL into a local PIL.Image
                    pil_image = fetch_and_convert(image.data)  # assume helper defined elsewhere
                    image = Image(pil_image)  # re-wrap into outlines.Image
                # Always append the base64 string for local runtime
                serialized_images.append(image.image_str)

            return {
                "role": role,
                "content": prompt,
                "images": serialized_images,
            }

        else:
            raise ValueError(
                f"Invalid content type: {type(content)}. "
                "The content must be a string or a [prompt, images…] list."
            )

A model that accepts URLs would have to just take the str in image.data

Example image fetch


import io
import httpx
from PIL import Image as PILImage

async def fetch_and_convert_image_async(url: str):
    """
    Fetch an image URL asynchronously and return a PIL.Image object.
    """
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (X11; Linux x86_64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/117.0 Safari/537.36"
        )
    }
    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.get(url, headers=headers)
        resp.raise_for_status()
        content_type = resp.headers.get("content-type", "").lower()
        if not content_type.startswith("image/"):
            raise ValueError(f"Expected image/* content, got {content_type}")
        return PILImage.open(io.BytesIO(resp.content))

Though I think it should be sync for now.

  • see

yasteven avatar Oct 03 '25 19:10 yasteven

Looks good overall, but I would not fetch images from an url for local models as it's adding troubles for us. Also I would avoid checking "image.data.startswith(("http://", "https://")):" in the type adapter. Ideally we would have methods is_url, is_bytes... in the image object for that. Additionnally we would have conversion methods when applicable (bytes to PIL for instance). I'm thinking about something similar to #1761

RobinPicard avatar Oct 07 '25 08:10 RobinPicard