ollama-python
ollama-python copied to clipboard
Multimodal Embeddings?
I'd like to generate embeddings using a multimodal model such as llama3.2-vision or minicam-v for images and text, for instance a pdf document with embedded images.
As far as I can tell this isn't supported, or at least it isn't documented.
Can someone explain to me what is needed here?
Thanks!
Hi @devlux76 this is possible today.
Take a look at this example:
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "ollama",
# ]
# ///
import os
import sys
import ollama
PROMPT = "Describe the provided image in a few sentences"
def run_inference(model: str, image_path: str):
stream = ollama.chat(
model=model,
messages=[{"role": "user", "content": PROMPT, "images": [image_path]}],
stream=True,
)
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)
def main():
if len(sys.argv) != 3:
print("Usage: python run.py <model_name> <image_path>")
sys.exit(1)
model_name = sys.argv[1]
image_path = sys.argv[2]
if not os.path.exists(image_path):
print(f"Error: Image file '{image_path}' does not exist.")
sys.exit(1)
run_inference(model_name, image_path)
if __name__ == "__main__":
main()