mlx-swift-examples LMInput restricts model input to a single collection of images and video frames

LMInput restricts model input to a single collection of images and video frames

Open davidkoski opened this issue 7 months ago • 1 comments

See #277 and #276

The UserInput struct can represent a series of messages with media attached to each image:

        return UserInput(
            chat: [
                .system(generate.system),
                .user(prompt, images: media.images, videos: media.videos),
            ],
            processing: media.processing
        )

This could include back and forth between the user and assistant including adding additional media.

The UserInputProcessor converts this to an LMInput:

public struct LMInput {
    public let text: Text
    public let image: ProcessedImage?
    public let video: ProcessedVideo?

but that only allows for one set of image/video. This should probably have:

    public let images: [ProcessedImage]
    public let videos: [ProcessedVideo]

though the model would have to be updated to take advantage of that.

Consider this chat:

> /image /tmp/img.jpeg


> what animal is in the image?
[["role": "system", "content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]]], ["role": "user", "content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**]]]
The animal in the image is a dog.

> /image /tmp/img2.jpeg


> describe the second image
[["content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]], "role": "system"], ["content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**], "role": "user"], ["content": [["type": "text", "text": "The animal in the image is a dog."]], "role": "assistant"], ["content": [["type": "text", "text": "describe the second image"], **["type": "image"]**], "role": "user"]]
The image shows a dog wearing a Santa hat.

Ideally this would present the second image for the second image marker. As it is today it will combine both images and inject them for the first marker.

Apr 23 '25 20:04 davidkoski

@ibrahimcetin FYI

@blaizzy I think the python mlx-vlm has the same issue (roughly). It doesn't have the same structures but it does treat the media as a single bundle.

Assuming of course it is supposed to work like this!

Apr 23 '25 21:04 davidkoski

mlx-swift-examples mlx-swift-examples copied to clipboard

LMInput restricts model input to a single collection of images and video frames

mlx-swift-examples
mlx-swift-examples copied to clipboard