Add base64 encoding support for daft.DataType.Image()
Is your feature request related to a problem?
When passing images to the OpenAI client, there are two options. Either you pass a URL or a data url that is base64 encoded. Since Daft has multimodal datatypes and is expanding it's native llm workload support, I propose that the Image Dtype be capable of encoding from Image to base64 string for inference.
vLLM is the perfect example of this where a image inputs can come in the form of:
import base64
import requests
from openai import OpenAI
client = OpenAI()
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
def encode_base64_content_from_url(content_url: str) -> str:
"""Encode a content retrieved from a remote url to base64 format."""
with requests.get(content_url) as response:
response.raise_for_status()
result = base64.b64encode(response.content).decode("utf-8")
return result
chat_completion_from_url = client.chat.completions.create(
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": image_url},
},
],
}
],
model=model,
max_completion_tokens=64,
)
result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from image url:", result)
## Use base64 encoded image in the payload
image_base64 = encode_base64_content_from_url(image_url)
chat_completion_from_base64 = client.chat.completions.create(
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_base64}"},
},
],
}
],
model=model,
max_completion_tokens=64,
)
result = chat_completion_from_base64.choices[0].message.content
print("Chat completion output from base64 encoded image:", result)
# Multi-image input inference
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
chat_completion_from_url = client.chat.completions.create(
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What are the animals in these images?"},
{
"type": "image_url",
"image_url": {"url": image_url_duck},
},
{
"type": "image_url",
"image_url": {"url": image_url_lion},
},
],
}
],
model=model,
max_completion_tokens=64,
)
result = chat_completion_from_url.choices[0].message.content
print("Chat completion output:", result)
Describe the solution you'd like
Given that inference clients prefer to pass in either an image_url (UTF8) or base64 encoded image string, I propose that we add a new daft.DataType.Image() encoding type with clear documentation examples on how to use it with usage patterns above.
Ideally, any user should be apply to simply pass a column with the Image datatype to our built in ai functions whether that be multimodal inference, training, embedding or otherwise.
Built in AI functions/expressions should be capable of multiplexing multimodal inputs for both Images and List[Images].
Describe alternatives you've considered
No response
Additional Context
All in all, the Image Datatype is powerful in a notebook setting for previews and as a centralized interface for image encoding/decoding/transformation. We can honor the strengths of the Datatype by making it "magically" work with most inference and embedding providers saving developers hours of debugging provider specific syntax.
Taking a step back, the solutions we implement here will undoubtedly inform how we pass Audio, File, and Video Types to inference clients and will certainly tie into the new and emerging concept of Resources.
A quick anecdote: When I was preparing a workload for structured outputs on images using a huggingface dataset, I found the Image DType to be super helpful for my human eyes to see the data. When it came to the actual workload however, since the dataset I was using came with images as byte_strings, I never really needed to pass through the Daft Imagetype, making it feel like a missed opportunity. I think ML/AI datasets have a couple different ways of providing images, but it would have been so much cooler if I had been able to use the Image DType for my inference calls.
Would you like to implement a fix?
No
@everettVT Can I ask what dataset you were using? I ask because I've seen some HuggingFace datasets that have the JPEG bytes as a column, which we need to do an .image.decode() to convert to an Image dtype in Daft. But the OpenAI client does accept base64-encoded JPEGs, so decoding is not always necessary, unless you want to do some sort of pre-processing on the image (resize, crop, etc) beforehand
@srilman I used the AI2D Subset from HuggingFaceM4/the_cauldron which stores images as png byte string.
For vLLM this didn't work for me, it had to be base64.
import daft
import base64
df_raw = daft.read_parquet('hf://datasets/HuggingFaceM4/the_cauldron/ai2d/train-00000-of-00001-2ce340398c113b79.parquet')
# To Get Daft Image
df = df_raw.explode(col("images")).with_column("image_png", df["images"].struct.get("bytes").image.decode())
# To Get Base64
df = df.with_column(
"image_base64", df["images"].struct.get("bytes").apply(
lambda x: base64.b64encode(x).decode('utf-8'),
return_dtype=daft.DataType.string()
)
).collect()
df.show()
@everettVT I opened a PR for Base64 encoding in general, which would work if you had images already in JPEG / PNG or encode to it. For example:
import daft
import base64
df_raw = daft.read_parquet('hf://datasets/HuggingFaceM4/the_cauldron/ai2d/train-00000-of-00001-2ce340398c113b79.parquet')
# To Get Base64
df = df_raw.explode(col("images")).with_column("image_base64", df["images"].struct.get("bytes").encode("base64"))
df.collect()
df.show()
Note that this isn't really specialized for images, so if you have a DataType.Image() column, you need to encode it first and then base64 it, like so:
col("image").image.encode("png").encode("base64")
This is to reduce the amount of duplication. However, if you feel like it would be easier to have a .image.base64_encode(format="jpeg" | "png") expression, then let us know. But internally, it would probably just do the 2 expressions.
I think that makes sense. Images aren't the only thing that require base64 encoding. Audio would need it as well for image/audio inputs to vLLM/SGLang.
This helps accomplish both! Very cool.
Great, closing then