ollama icon indicating copy to clipboard operation
ollama copied to clipboard

Pass in prompt as arguments is broken for multimodal models

Open elsatch opened this issue 1 year ago • 4 comments

I have been trying to evaluate the performance of the multimodal models passing the prompt as argument, as stated in the README.md section. Whenever I pass the image as an argument for ollama cli, it hallucinates the whole response. Asking about the same image using the regular chat, works without problem.

This is a sample image I am using:

image

These are the responses I am getting when passing the image as argument:

image

Note: original image taken created Leonid Mamchenkov, using the Carbon website to style the code.

elsatch avatar Dec 16 '23 05:12 elsatch

What do you mean by regular chat? The multi modal models do a pretty good job with OCR, but they aren't going to be as good as a full OCR engine. You will probably get better results using an OCR engine and then using the model on that.

technovangelist avatar Dec 19 '23 19:12 technovangelist

When using Ollama's multimodal capabilities, initiating the model with a direct command to analyze an image (e.g., "ollama run llava:13b 'What is in this image /tmp/test_image.jpg'") leads to hallucination outputs (as the original report stated, I see Japanese characters output).

However, starting the model normally and then entering the image query as a dialogue yields correct results. This suggests an inconsistency in how the model handles direct image processing commands versus interactive dialogue prompts.

markab21 avatar Jan 04 '24 02:01 markab21

@markab21 thanks for the succint and clear explanation.

When I talked about regular chat, I meant the interactive chat that launches when you run ollama in the command line without any prompts (as opposed to passing the prompt as in the example provided by Mark)

elsatch avatar Jan 04 '24 10:01 elsatch

I would also like to know how to pass the image when running from the command line or from python. I can only get it to work in the interactive chat.

sam1am avatar Jan 13 '24 04:01 sam1am

I second that, right now it seems impossible to use ollama's multimodal capabilities from command line for scripting or something like that

antonme avatar Jan 19 '24 17:01 antonme

Sorry for the slow response. This did get fixed a while back but the issue never got updated. Here's an example:

% ./ollama run llava:13b "Describe this image: /Users/pdevine/Pictures/steve.png"
Added image '/Users/pdevine/Pictures/steve.png'
 The image is a black and white sketch of a person wearing a hard hat. The individual appears to be smiling or laughing,
with a relaxed posture, indicating a sense of joy or amusement. On the head of the figure, there's text that reads "PASCO,"
which could possibly be related to the brand of the hard hat or it might simply be part of the name of the person depicted.
The drawing style is somewhat reminiscent of comic strip illustrations with bold lines and a simplistic form. Below the
figure, there's another line of text that reads "Sam," which could either be the name of the artist or the name of the
character being drawn.

You can see from the Added image '/Users/pdevine/Pictures/steve.png' that the image was added.

@sam1am To make this work from Python, it depends on if you are using the Chat function or the Generate function. Both functions take base64 encoded images (you'll have to convert the image yourself) and then you can pass that as a string along with the request in the images field.

Here's an example:

import ollama
response = ollama.chat(model='llava:13b', messages=[
  {
    'role': 'user',
    'content': 'What's in this image?',
    'images': [ <list of base64 encoded images> ],
  },
])
print(response['message']['content'])

Hopefully that answers everyone's questions. I'm going to go ahead and close the issue.

pdevine avatar Mar 11 '24 23:03 pdevine