Multimodal Embeddings?

Open devlux76 opened this issue 11 months ago • 1 comments

I'd like to generate embeddings using a multimodal model such as llama3.2-vision or minicam-v for images and text, for instance a pdf document with embedded images.

As far as I can tell this isn't supported, or at least it isn't documented.

Can someone explain to me what is needed here?

Thanks!

Jan 10 '25 04:01 devlux76

Hi @devlux76 this is possible today.

Take a look at this example:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "ollama",
# ]
# ///

import os
import sys
import ollama

PROMPT = "Describe the provided image in a few sentences"


def run_inference(model: str, image_path: str):
    stream = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": PROMPT, "images": [image_path]}],
        stream=True,
    )

    for chunk in stream:
        print(chunk["message"]["content"], end="", flush=True)


def main():
    if len(sys.argv) != 3:
        print("Usage: python run.py <model_name> <image_path>")
        sys.exit(1)

    model_name = sys.argv[1]
    image_path = sys.argv[2]

    if not os.path.exists(image_path):
        print(f"Error: Image file '{image_path}' does not exist.")
        sys.exit(1)

    run_inference(model_name, image_path)


if __name__ == "__main__":
    main()

Jan 17 '25 21:01 dijarvrella