optimum
optimum copied to clipboard
Support for Florence 2 model
Feature request
When trying to export Florence 2, it fails with a bizarre error message that leads me to believe it's not supported.
D:\Redacted\>optimum-cli export onnx --model microsoft/Florence-2-large --trust-remote-code --framework pt flo2
D:\miniconda3\envs\onnx-export\Lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "D:\miniconda3\envs\onnx-export\Scripts\optimum-cli.exe\__main__.py", line 7, in <module>
File "D:\miniconda3\envs\onnx-export\Lib\site-packages\optimum\commands\optimum_cli.py", line 163, in main
service.run()
File "D:\miniconda3\envs\onnx-export\Lib\site-packages\optimum\commands\export\onnx.py", line 265, in run
main_export(
File "D:\miniconda3\envs\onnx-export\Lib\site-packages\optimum\exporters\onnx\__main__.py", line 280, in main_export
model = TasksManager.get_model_from_task(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\miniconda3\envs\onnx-export\Lib\site-packages\optimum\exporters\tasks.py", line 1950, in get_model_from_task
model = model_class.from_pretrained(model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\miniconda3\envs\onnx-export\Lib\site-packages\transformers\models\auto\auto_factory.py", line 566, in from_pretrained
raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.microsoft.Florence-2-large.ef29c9b007f906bd278c39bc12ae620398d88c88.configuration_florence2.Florence2Config'> for this kind of AutoModel: AutoModelForVision2Seq.
Model type should be one of BlipConfig, Blip2Config, GitConfig, Idefics2Config, InstructBlipConfig, Kosmos2Config, LlavaConfig, LlavaNextConfig, PaliGemmaConfig, Pix2StructConfig, VideoLlavaConfig, VipLlavaConfig, VisionEncoderDecoderConfig.
Motivation
I want to export an onnx model of Florence 2. I kind of thought that's the type of thing this tool is used for, yeah? Take a non-onnx repo from hugging face, and export an onnx model.
Your contribution
Not really. I'm willing to test if there's something y'all want me to try?
+1
anyone give some tips on how to export?
I've refactored the DaVit part of Florence to be compatible with Huggingface, if this helps
https://huggingface.co/amaye15/DaViT-Florence-2-large-ft
I've refactored the DaVit part of Florence to be compatible with Huggingface, if this helps
https://huggingface.co/amaye15/DaViT-Florence-2-large-ft
So do you succed to export onnx?
I've refactored the DaVit part of Florence to be compatible with Huggingface, if this helps
https://huggingface.co/amaye15/DaViT-Florence-2-large-ft
hi, thank for your reply. Did you export the language model of florenc2?
Not yet, might have a look at that this weekend.
@amaye15 Hi, bro, thank you for you advice. Following toue advice of davit solution, I have succed in exporting my model, but I have no idea how to do post processing, the generated result is a array of float32, can you give me some advice how to do next?
related to #1949
Any updates on this ?
This issue has been marked as stale because it has been open for 30 days with no activity. This thread will be automatically closed in 5 days if no further activity occurs.
I've published my conversion code here, if anyone is interested :)
@xenova Joshua, thanks for publishing your florence-2 onnx conversion code. I apologize for the dumb question, but how would I take the set of onnx files your export code produces and then properly use them to do say just pure object detection inference with Florence-2 in pytorch?
I'm guessing it would resemble something like this broken code below, but not really sure how to proceed? `import onnxruntime as ort import numpy as np import torch from PIL import Image from transformers import AutoProcessor
Load ONNX models
vision_encoder_session = ort.InferenceSession("converted/vision_encoder.onnx") encoder_session = ort.InferenceSession("converted/encoder_model.onnx") decoder_session = ort.InferenceSession("converted/decoder_model_merged.onnx")
Load the Florence-2 processor for preprocessing
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
def process_image(image_path): image = Image.open(image_path).convert("RGB") inputs = processor(images=image, return_tensors="pt") pixel_values = inputs["pixel_values"].numpy() return pixel_values
def encode_image(pixel_values): image_features = vision_encoder_session.run(["image_features"], {"pixel_values": pixel_values})[0] return image_features
def initialize_past_key_values(num_layers, batch_size, num_heads, seq_length, head_dim): past_key_values = { f"past_key_values.{layer}.decoder.key": np.zeros((batch_size, num_heads, seq_length, head_dim), dtype=np.float32) for layer in range(num_layers) } past_key_values.update({ f"past_key_values.{layer}.decoder.value": np.zeros((batch_size, num_heads, seq_length, head_dim), dtype=np.float32) for layer in range(num_layers) }) past_key_values.update({ f"past_key_values.{layer}.encoder.key": np.zeros((batch_size, num_heads, seq_length, head_dim), dtype=np.float32) for layer in range(num_layers) }) past_key_values.update({ f"past_key_values.{layer}.encoder.value": np.zeros((batch_size, num_heads, seq_length, head_dim), dtype=np.float32) for layer in range(num_layers) }) return past_key_values
def detect_objects(encoder_outputs):
batch_size, seq_len, hidden_dim = encoder_outputs.shape
encoder_attention_mask = np.ones((batch_size, seq_len), dtype=np.int64)
num_heads = 12
head_dim = hidden_dim // num_heads # Calculate per-head dimension
past_key_values = initialize_past_key_values(num_layers, batch_size, num_heads, seq_len, head_dim)
inputs_embeds = np.zeros((batch_size, seq_len, hidden_dim), dtype=np.float32)
use_cache_branch = np.array([0], dtype=np.int64)
decoder_inputs = {
"encoder_attention_mask": encoder_attention_mask,
"encoder_hidden_states": encoder_outputs,
"inputs_embeds": inputs_embeds,
"use_cache_branch": use_cache_branch,
**past_key_values,
}
decoder_outputs = decoder_session.run(["logits"], decoder_inputs)[0]
token_ids = np.argmax(decoder_outputs, axis=-1)
detected_objects = processor.tokenizer.batch_decode(token_ids, skip_special_tokens=True)
return detected_objects
if name == "main": image_path = "test.jpg" pixel_values = process_image(image_path) image_features = encode_image(pixel_values) detected_objects = detect_objects(image_features)