Example to use Phi 3.5 Vision to compare two images and/or multiple frames using ONNX/C#
Please provide us with the following information:
This issue is for a: (mark with an x)
- [ ] bug report -> please search issues before submitting
- [X] feature request
- [X] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
Currently I can't find a way to input multiple images to Phi 3.5 Vision using C#/ONNX. This is possible in Python/HF. If it is not possible in ONNX/C# can you please let know (or even perhaps provide a fix?) and if it is possible can you provide an example?
Any log messages given by the failure
Expected/desired behavior
To input multiple images to Phi 3.5 Vision when using C#/ONNX.
OS and Version?
Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
azd version?
run
azd versionand copy paste here.
Versions
Mention any other details that might be useful
Thanks! We'll be in touch soon.
Currently, the ONNX runtime does not support loading multiple images for the Phi 3.5 Vision model in C#. This limitation means that you cannot directly input multiple images as you can with Python/Hugging Face.
However, you can process multiple images sequentially by loading and processing each image one at a time. Here's a basic example of how you might do this in C#:
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Drawing;
using System.IO;
class Program
{
static void Main(string[] args)
{
string modelPath = "path_to_your_model.onnx";
string[] imagePaths = { "image[1](https://github.com/microsoft/onnxruntime-genai/discussions/563).jpg", "image2.jpg" }; // Add your image paths here
using var session = new InferenceSession(modelPath);
foreach (var imagePath in imagePaths)
{
var inputTensor = LoadImageAsTensor(imagePath);
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input", inputTensor)
};
using var results = session.Run(inputs);
var output = results.First().AsTensor<float>().ToArray();
Console.WriteLine($"Processed {imagePath}: {string.Join(", ", output)}");
}
}
static DenseTensor<float> LoadImageAsTensor(string imagePath)
{
using var bitmap = new Bitmap(imagePath);
var tensor = new DenseTensor<float>(new[] { 1, 3, bitmap.Height, bitmap.Width });
for (int y = 0; y < bitmap.Height; y++)
{
for (int x = 0; x < bitmap.Width; x++)
{
var color = bitmap.GetPixel(x, y);
tensor[0, 0, y, x] = color.R / 255.0f;
tensor[0, 1, y, x] = color.G / 255.0f;
tensor[0, 2, y, x] = color.B / 255.0f;
}
}
return tensor;
}
}
This example demonstrates how to load and process each image one by one. While this isn't as efficient as processing multiple images simultaneously, it allows you to work within the current limitations of the ONNX runtime.
If you have any further questions or need more assistance, feel free to ask! I would recommend you add a issue to the ONNXRuntime repo https://github.com/microsoft/onnxruntime
Some additional discussions and resources
- Comparison of multiple images with the Phi-3 Vision model. https://github.com/microsoft/onnxruntime-genai/discussions/563.
- Using Phi-3 & C# with ONNX for text and vision samples. https://devblogs.microsoft.com/dotnet/using-phi3-csharp-with-onnx-for-text-and-vision-samples-md/.
- Using Phi-3 & C# with ONNX for text and vision samples. https://techcommunity.microsoft.com/t5/educator-developer-blog/using-phi-3-amp-c-with-onnx-for-text-and-vision-samples/ba-p/4161020.
- Phi-3.5-vision-instruct #837 - GitHub. https://github.com/microsoft/onnxruntime-genai/issues/837.
Hi Lee and thanks for responding!
Just to clarify this is like running multiple separate individual prompts, right?
How would I combine text [and prompt formatting] with the image tensors when doing it like this?
Yes, you're correct. Each image is processed individually in the loop, and the results are handled separately for each image.
To combine text and image tensors, you can create a multi-modal input for your model. Here's an example of how you might modify your code to include text input along with image tensors:
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
class Program
{
static void Main(string[] args)
{
string modelPath = "path_to_your_model.onnx";
string[] imagePaths = { "image1.jpg", "image2.jpg" }; // Add your image paths here
string textInput = "Your text input here";
using var session = new InferenceSession(modelPath);
foreach (var imagePath in imagePaths)
{
var inputTensor = LoadImageAsTensor(imagePath);
var textTensor = LoadTextAsTensor(textInput);
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("image_input", inputTensor),
NamedOnnxValue.CreateFromTensor("text_input", textTensor)
};
using var results = session.Run(inputs);
var output = results.First().AsTensor<float>().ToArray();
Console.WriteLine($"Processed {imagePath}: {string.Join(", ", output)}");
}
}
static DenseTensor<float> LoadImageAsTensor(string imagePath)
{
using var bitmap = new Bitmap(imagePath);
var tensor = new DenseTensor<float>(new[] { 1, 3, bitmap.Height, bitmap.Width });
for (int y = 0; y < bitmap.Height; y++)
{
for (int x = 0; x < bitmap.Width; x++)
{
var color = bitmap.GetPixel(x, y);
tensor[0, 0, y, x] = color.R / 255.0f;
tensor[0, 1, y, x] = color.G / 255.0f;
tensor[0, 2, y, x] = color.B / 255.0f;
}
}
return tensor;
}
static DenseTensor<float> LoadTextAsTensor(string text)
{
var words = text.Split(' ');
var tensor = new DenseTensor<float>(new[] { 1, words.Length });
for (int i = 0; i < words.Length; i++)
{
tensor[0, i] = ConvertWordToFloat(words[i]);
}
return tensor;
}
static float ConvertWordToFloat(string word)
{
// Simple example: convert each character to its ASCII value and sum them up
float sum = 0;
foreach (var ch in word)
{
sum += ch;
}
return sum;
}
}
In this example, LoadTextAsTensor converts the text input into a tensor. You can replace ConvertWordToFloat with a more sophisticated text encoding method, such as word embeddings or tokenization, depending on your model's requirements.