PhiCookBook icon indicating copy to clipboard operation
PhiCookBook copied to clipboard

How to get text coordinates (bbox) from phi-3 vision

Open ladanisavan opened this issue 1 year ago • 4 comments

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Hello,

First, thank you for the incredible work you have shared with the phi community. I am wondering if there is a way to obtain the text coordinates (bounding boxes) from the phi-3 vision generated output for an input image? This feature would be immensely beneficial for various applications that rely on precise text positioning.

Thank you for considering this request.

ladanisavan avatar Aug 02 '24 10:08 ladanisavan

@ChenRocks thoughts on the above feature?

leestott avatar Aug 05 '24 06:08 leestott

@ladanisavan

To achieve this, you can use the ONNX Runtime with the Phi-3 vision model.

Here’s a general approach:

  1. Setup: Ensure you have the necessary tools and libraries installed, such as ONNX Runtime and the Phi-3 vision model. You can find the models on platforms like Azure AI Catalog or Hugging Face.

  2. Run the Model: Use the ONNX Runtime to run the Phi-3 vision model on your input image. The model will process the image and generate the output, including text and its coordinates.

  3. Extract Bounding Boxes: The output from the model will include the bounding boxes for the detected text. These boxes are typically represented by the coordinates of the top-left corner (x, y) and the width and height of the box.

Here is a simplified example of how you might set this up in Python:

import onnxruntime as ort
import numpy as np
from PIL import Image

# Load the model
session = ort.InferenceSession("path_to_phi3_model.onnx")

# Preprocess the image
image = Image.open("path_to_image.jpg")
input_data = np.array(image).astype(np.float32)

# Run the model
outputs = session.run(None, {"input": input_data})

# Extract bounding boxes from the output
bounding_boxes = outputs[0]  # Assuming the first output contains the bounding boxes

for box in bounding_boxes:
    x, y, width, height = box
    print(f"Bounding box: x={x}, y={y}, width={width}, height={height}")

Source Code Examples & ONNX Models: Phi-3 vision tutorial | onnxruntime

Phi-3 vision onnx cpu Model Phi-3 vision cuda onnx Model

leestott avatar Aug 14 '24 13:08 leestott

@leestott

Thank you for getting back to me. Have you tested this on your side? It's not working on my side.

ladanisavan avatar Aug 14 '24 14:08 ladanisavan

Thanks @ladanisavan for your inquiry. Unfortunately, BBox support is currently not available in Phi-3.x-vision. We appreciate this feedback and will discuss this feature request for future versions.

In the meanwhile, I personally recommend Florence-2.

ChenRocks avatar Aug 20 '24 22:08 ChenRocks