guidance Images are not visible in the console and are invalid in the ChatML / string representation.

The bug While images are visible in Jupyter, the ChatML/string representation becomes invalid once the runtime stops, and the images are not visible in the console.

<|im_start|>user
What is the capital of <|_image:94590504830032|>?<|im_end|>
<|im_start|>assistant
I'm unable to view or interpret images directly. However, if you provide a description of the image or additional context, I'd be happy to help you with any information you need!<|im_end|>

The current image representation is a custom ChatML tag - <|_image:94590504830032|> with the number valid only during runtime, referencing the local runtime object. As a result, the produced ChatML log str(lm) is effectively invalid. Note also, <|_image:xxx|> is not a valid markdown or HTML, and as a result, regular markdown viewers that can view the rest of the ChatML exchange can't display images.

It'd be good to normalize the representation and use valid Markdown tags <img src="https://..."> or <img src="file://"> or : <img src="data:image/jpeg instead of newly introduced <|_image: |> tag.

To Reproduce


        with user():
            lm = self.gpt + f"What is the capital of " + image("france.jpg") + "?"

        with assistant():
            lm += gen("capital")

System info (please complete the following information):

OS: Ubuntu 22.04, vscode
Guidance Version: 0.1.16

Nov 15 '24 19:11 dchichkov

If this helps, I've implemented a hook, so the URLs get expanded, as a workaround for vLLM / OpenAI hosted VLMs:

def hook(request):
    j = json.loads(request.content)
    
    def process_content(input_str):
        # Regular expression to split text and img tags, capturing the image URL.
        parts = re.split(r'(<img\s+[^>]*src=[\'"]([hfd].*?)[\'"][^>]*>)', input_str)
        content, image_url = [], None

        for i in range(len(parts)):
            if i % 3 == 0:  # Text part (regular text before or after <img> tags)
                text_part = parts[i].strip()
                if text_part:
                    content.append({"type": "text", "text": text_part})
            elif i % 3 == 2:  # Image URL part (captured inside the <img> tag)
                image_url = parts[i].strip()
                content.append({"type": "image_url", "image_url": {"url": image_url}})
                
        if not image_url:
            return input_str

        return content

    # Split the content of the message into text and image_url parts
    for message in j['messages']:
        if isinstance(message["content"], str):
            message["content"] = process_content(message["content"])

    modified_content = json.dumps(j).encode('utf-8')
    request.stream = httpx.ByteStream(modified_content)
    request.headers['Content-Length'] = str(len(modified_content))

It can be used with:

...
import httpx, json

llm = models.OpenAI(...
                    http_client=httpx.Client(event_hooks={'request': [hook]}))
                    
llm = llm + f"Is there a dog on this image? " + f"<img src=' ... '>" + "?"
...

Nov 15 '24 22:11 dchichkov

Thank you for the feedback. We are actively working on a full-stack rework of multimodal support in Guidance. Part of this rework involves reformatting the way that prompt data is represented internally, which should enable us to have more flexibility in how we present the data to users.

Can you share more info on how you are trying to use the output of str(lm)? I would like to reproduce the error you're getting. It sounds like you expect str(lm) to result in a valid HTML string with the image data encoded in an img tag. Currently, str(lm) is meant to output the internal string prompt representation that is a Guidance-specific format. With more info about your use case, I can consider how we might make a better function for you to be able to achieve the output you want. There is ongoing work to rework the Jupyter UX in addition to adding better multimodal support, so this is a good opportunity to take your feedback into consideration.

Nov 15 '24 23:11 nking-1

I expect str(lm) to result in a valid ChatML / Markdown string, like it was in the text-only case. It seems that ChatML was designed to work well in Markdown (de-facto standard for LLM output formatting), I was assuming that Guidance also used that convention.

I apply a small bugfix to str(lm) output, re-arranging spacing around the tags, so they'd render the same in the Markdown viewer as in the ChatML documentation https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chat-markup-language#working-with-chat-markup-language-chatml. And for LaTeX, formatting, images or videos I use either Markdown/HTML tags. This allows the use of a rich ecosystem of Markdown parsers, renderers and editors. In the case of images, I prefer to use tag, as they are supported in all Markdown implementations and allow to use richer src URI syntax, including local file, b64 data or URLs.

<|im_start|>system

Provide some context and/or instructions to the model.

<|im_end|>

<|im_start|>user The user’s message goes here

<|im_end|>

<|im_start|>assistant

Nov 16 '24 17:11 dchichkov