mlx-swift-examples
mlx-swift-examples copied to clipboard
LMInput restricts model input to a single collection of images and video frames
See #277 and #276
The UserInput struct can represent a series of messages with media attached to each image:
return UserInput(
chat: [
.system(generate.system),
.user(prompt, images: media.images, videos: media.videos),
],
processing: media.processing
)
This could include back and forth between the user and assistant including adding additional media.
The UserInputProcessor converts this to an LMInput:
public struct LMInput {
public let text: Text
public let image: ProcessedImage?
public let video: ProcessedVideo?
but that only allows for one set of image/video. This should probably have:
public let images: [ProcessedImage]
public let videos: [ProcessedVideo]
though the model would have to be updated to take advantage of that.
Consider this chat:
> /image /tmp/img.jpeg
> what animal is in the image?
[["role": "system", "content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]]], ["role": "user", "content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**]]]
The animal in the image is a dog.
> /image /tmp/img2.jpeg
> describe the second image
[["content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]], "role": "system"], ["content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**], "role": "user"], ["content": [["type": "text", "text": "The animal in the image is a dog."]], "role": "assistant"], ["content": [["type": "text", "text": "describe the second image"], **["type": "image"]**], "role": "user"]]
The image shows a dog wearing a Santa hat.
Ideally this would present the second image for the second image marker. As it is today it will combine both images and inject them for the first marker.
@ibrahimcetin FYI
@blaizzy I think the python mlx-vlm has the same issue (roughly). It doesn't have the same structures but it does treat the media as a single bundle.
Assuming of course it is supposed to work like this!