Widen asset format supports
Currently an Image can only be instantiated with a PIL.Image for instance, but some models accept images as urls and users may want to give a binary. We should allow a wider range of formats.
I could work on this after my current PR 1755 gets reviewed and accepted.
Awesome! I think the most useful part is accepting a url for images, but then it means such images could not be used for local models, so there is a bit to change there.
@RobinPicard Let me know your thoughts:
Issue #1742: Widen asset format supports
Current state: outlines.inputs.Image only accepts PIL.Image objects. Problem: remote multimodal APIs (OpenAI, Anthropic, Gemini) often accept URLs or binaries directly, while local models (transformers, llama.cpp, etc.) expect preprocessed image tensors.
Proposed solution:
- Make the Image constructor polymorphic: Image(pil_image) // accept PIL.Image Image("https://") // accept URL Image(b"...") // accept raw bytes Image("/path.png") // accept local file path
- Also provide explicit helpers (from_pil, from_url, from_bytes, from_path) for clarity when needed.
- Proposed Behavior:
- Remote backends: pass URL directly, or upload binary.
- Local backends: if URL is given, auto-fetch and convert to PIL.Image. If bytes are given, open via PIL.Image.open(io.BytesIO(...)).
- This keeps the API identical between local and remote models, with transparent handling of each input type.
- Document/Changes: document that local backends may trigger implicit network calls when given a URL, to avoid surprising users.
Example changes to model integration for a local model:
def _create_message(self, role: str, content: str | list) -> dict:
"""Create a message for Ollama, supporting multimodal inputs."""
if isinstance(content, str):
return {
"role": role,
"content": content,
}
elif isinstance(content, list):
prompt = content[0]
images = content[1:]
if not all(isinstance(image, Image) for image in images):
raise ValueError("All assets provided must be of type outlines.Image")
serialized_images = []
for image in images:
if isinstance(image.data, str) and image.data.startswith(("http://", "https://")):
# For Ollama (offline), resolve the URL into a local PIL.Image
pil_image = fetch_and_convert(image.data) # assume helper defined elsewhere
image = Image(pil_image) # re-wrap into outlines.Image
# Always append the base64 string for local runtime
serialized_images.append(image.image_str)
return {
"role": role,
"content": prompt,
"images": serialized_images,
}
else:
raise ValueError(
f"Invalid content type: {type(content)}. "
"The content must be a string or a [prompt, images…] list."
)
A model that accepts URLs would have to just take the str in image.data
Example image fetch
import io
import httpx
from PIL import Image as PILImage
async def fetch_and_convert_image_async(url: str):
"""
Fetch an image URL asynchronously and return a PIL.Image object.
"""
headers = {
"User-Agent": (
"Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/117.0 Safari/537.36"
)
}
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.get(url, headers=headers)
resp.raise_for_status()
content_type = resp.headers.get("content-type", "").lower()
if not content_type.startswith("image/"):
raise ValueError(f"Expected image/* content, got {content_type}")
return PILImage.open(io.BytesIO(resp.content))
Though I think it should be sync for now.
- see
Looks good overall, but I would not fetch images from an url for local models as it's adding troubles for us. Also I would avoid checking "image.data.startswith(("http://", "https://")):" in the type adapter. Ideally we would have methods is_url, is_bytes... in the image object for that. Additionnally we would have conversion methods when applicable (bytes to PIL for instance). I'm thinking about something similar to #1761