PhiCookBook Example to use Phi 3.5 Vision to compare two images and/or multiple frames using ONNX/C#

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [X] feature request
- [X] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Currently I can't find a way to input multiple images to Phi 3.5 Vision using C#/ONNX. This is possible in Python/HF. If it is not possible in ONNX/C# can you please let know (or even perhaps provide a fix?) and if it is possible can you provide an example?

Any log messages given by the failure

Expected/desired behavior

To input multiple images to Phi 3.5 Vision when using C#/ONNX.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

Aug 23 '24 15:08 TamirHCL

Currently, the ONNX runtime does not support loading multiple images for the Phi 3.5 Vision model in C#. This limitation means that you cannot directly input multiple images as you can with Python/Hugging Face.

However, you can process multiple images sequentially by loading and processing each image one at a time. Here's a basic example of how you might do this in C#:

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Drawing;
using System.IO;

class Program
{
    static void Main(string[] args)
    {
        string modelPath = "path_to_your_model.onnx";
        string[] imagePaths = { "image[1](https://github.com/microsoft/onnxruntime-genai/discussions/563).jpg", "image2.jpg" }; // Add your image paths here

        using var session = new InferenceSession(modelPath);

        foreach (var imagePath in imagePaths)
        {
            var inputTensor = LoadImageAsTensor(imagePath);
            var inputs = new List<NamedOnnxValue>
            {
                NamedOnnxValue.CreateFromTensor("input", inputTensor)
            };

            using var results = session.Run(inputs);
            var output = results.First().AsTensor<float>().ToArray();
            Console.WriteLine($"Processed {imagePath}: {string.Join(", ", output)}");
        }
    }

    static DenseTensor<float> LoadImageAsTensor(string imagePath)
    {
        using var bitmap = new Bitmap(imagePath);
        var tensor = new DenseTensor<float>(new[] { 1, 3, bitmap.Height, bitmap.Width });

        for (int y = 0; y < bitmap.Height; y++)
        {
            for (int x = 0; x < bitmap.Width; x++)
            {
                var color = bitmap.GetPixel(x, y);
                tensor[0, 0, y, x] = color.R / 255.0f;
                tensor[0, 1, y, x] = color.G / 255.0f;
                tensor[0, 2, y, x] = color.B / 255.0f;
            }
        }

        return tensor;
    }
}

This example demonstrates how to load and process each image one by one. While this isn't as efficient as processing multiple images simultaneously, it allows you to work within the current limitations of the ONNX runtime.

If you have any further questions or need more assistance, feel free to ask! I would recommend you add a issue to the ONNXRuntime repo https://github.com/microsoft/onnxruntime

Some additional discussions and resources

Comparison of multiple images with the Phi-3 Vision model. https://github.com/microsoft/onnxruntime-genai/discussions/563.
Using Phi-3 & C# with ONNX for text and vision samples. https://devblogs.microsoft.com/dotnet/using-phi3-csharp-with-onnx-for-text-and-vision-samples-md/.
Using Phi-3 & C# with ONNX for text and vision samples. https://techcommunity.microsoft.com/t5/educator-developer-blog/using-phi-3-amp-c-with-onnx-for-text-and-vision-samples/ba-p/4161020.
Phi-3.5-vision-instruct #837 - GitHub. https://github.com/microsoft/onnxruntime-genai/issues/837.

Sep 02 '24 13:09 leestott

Hi Lee and thanks for responding!

Just to clarify this is like running multiple separate individual prompts, right?

How would I combine text [and prompt formatting] with the image tensors when doing it like this?

Sep 03 '24 07:09 TamirHCL

Yes, you're correct. Each image is processed individually in the loop, and the results are handled separately for each image.

To combine text and image tensors, you can create a multi-modal input for your model. Here's an example of how you might modify your code to include text input along with image tensors:

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;

class Program
{
    static void Main(string[] args)
    {
        string modelPath = "path_to_your_model.onnx";
        string[] imagePaths = { "image1.jpg", "image2.jpg" }; // Add your image paths here
        string textInput = "Your text input here";

        using var session = new InferenceSession(modelPath);

        foreach (var imagePath in imagePaths)
        {
            var inputTensor = LoadImageAsTensor(imagePath);
            var textTensor = LoadTextAsTensor(textInput);

            var inputs = new List<NamedOnnxValue>
            {
                NamedOnnxValue.CreateFromTensor("image_input", inputTensor),
                NamedOnnxValue.CreateFromTensor("text_input", textTensor)
            };

            using var results = session.Run(inputs);
            var output = results.First().AsTensor<float>().ToArray();
            Console.WriteLine($"Processed {imagePath}: {string.Join(", ", output)}");
        }
    }

    static DenseTensor<float> LoadImageAsTensor(string imagePath)
    {
        using var bitmap = new Bitmap(imagePath);
        var tensor = new DenseTensor<float>(new[] { 1, 3, bitmap.Height, bitmap.Width });

        for (int y = 0; y < bitmap.Height; y++)
        {
            for (int x = 0; x < bitmap.Width; x++)
            {
                var color = bitmap.GetPixel(x, y);
                tensor[0, 0, y, x] = color.R / 255.0f;
                tensor[0, 1, y, x] = color.G / 255.0f;
                tensor[0, 2, y, x] = color.B / 255.0f;
            }
        }

        return tensor;
    }

    static DenseTensor<float> LoadTextAsTensor(string text)
    {
        var words = text.Split(' ');
        var tensor = new DenseTensor<float>(new[] { 1, words.Length });

        for (int i = 0; i < words.Length; i++)
        {
            tensor[0, i] = ConvertWordToFloat(words[i]);
        }

        return tensor;
    }

    static float ConvertWordToFloat(string word)
    {
        // Simple example: convert each character to its ASCII value and sum them up
        float sum = 0;
        foreach (var ch in word)
        {
            sum += ch;
        }
        return sum;
    }
}

In this example, LoadTextAsTensor converts the text input into a tensor. You can replace ConvertWordToFloat with a more sophisticated text encoding method, such as word embeddings or tokenization, depending on your model's requirements.

Sep 05 '24 12:09 leestott

Example to use Phi 3.5 Vision to compare two images and/or multiple frames using ONNX/C#

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

Mention any other details that might be useful

Some additional discussions and resources

This issue is for a: (mark with an `x`)