Foundry-Local Running fine tuned Florence-2 model locally

I would like to run a local vision model that is ultimately fine-tuned with domain specific images and data on PCs and use NPU (Snapdragon Hexagon) or GPU depending on the PC. Florence 2 (ONNX) seems the perfect candidate but will not run with NPU using the Microsoft.ML.OnnxRuntime.QNN library. However, I understand that Microsoft is doing exactly that in the Microsoft Vision SDK. How do I reuse this in a way that I can fine tune the Florence-2 model? (Or where can I find the Florence-2 model that Microsoft is using running natively with Hexagon NPU?) Thanks!

Oct 02 '25 12:10 v-croft

Hi @v-croft

Microsoft is indeed leveraging Florence-2 in the Vision SDK, and while ONNX Runtime QNN doesn’t yet support Florence-2 directly on Hexagon NPU, there are promising paths forward.

Florence-2 ONNX Availability

Microsoft has released Florence-2 models on Hugging Face in PyTorch format, and the community (via Xenova) has converted them to ONNX:
- Florence-2-base-ft ONNX
These ONNX models are suitable for local inference on GPU and CPU, but not yet optimized for Hexagon NPU via QNN.

Why QNN Doesn’t Work Yet

Florence-2 uses a causal decoder architecture with vision-language fusion, which isn’t fully supported by QNN’s current operator set.
QNN excels with CNNs and lightweight transformers, but Florence-2’s sequence-to-sequence design and dynamic token generation pose challenges.

What Microsoft Is Doing in Vision SDK

The Microsoft Vision SDK uses Florence-2 variants optimized for edge deployment, including:
- Quantized models
- Operator fusion tailored for Hexagon
- Custom runtime layers not exposed via public ONNX/QNN APIs

These models are not publicly released in ONNX form for fine-tuning, but they are used internally for tasks like OCR, object detection, and captioning on Surface and Snapdragon devices see EdgeAI for Beginners Windows AI PC Developers and the samples and AI Dev Gallery Windows 11 app.

Reuse and Fine-Tune Florence-2

1. Start with Hugging Face PyTorch Models

Use microsoft/Florence-2-base or Florence-2-large from Hugging Face
Fine-tune using domain-specific data with PyTorch + Hugging Face Transformers
Example fine-tuning guide: Genspark Tutorial and Microsoft Olive

2. Convert to ONNX After Fine-Tuning, OLIVE does this natively

Use torch.onnx.export() to convert your fine-tuned model
Optimize with onnxruntime-tools and quantize if needed

3. Test with QNN or Fallback to GPU

Try loading the ONNX model with Microsoft.ML.OnnxRuntime.QNN
If unsupported ops block NPU execution, fallback to GPU or CPU
Consider segmenting Florence-2 into submodules (e.g., vision encoder + text decoder) and running only the encoder on NPU

Note

Microsoft has acknowledged Florence-2 ONNX support as a roadmap item
Stay tuned to onnxruntime-genai GitHub for updates on QNN compatibility and SDK releases

Oct 02 '25 17:10 leestott

Hi @leestott Thank you for the quick and comprehensive answer. I had tried most of what you suggested and your summary leads me to think that this is not going to work currently. The problem seems to be that the model, as it currently is, will not work with the QNN execution layer. In practice, I can only get the decoder to run natively on the NPU; the vision model and encoder both fail. The QNN documentation suggests there is still a lot of work to be done to properly support ONNX models like Florence-2. It is a pity to provide good NPU hardware (seemingly even better with the next Snapdragon Hexagon iteration) but without the technical layers to be able to use them easily! It would be great if Microsoft could make its QNN tailored version of Florence-2 available for training. Or am I missing something?

Oct 07 '25 09:10 v-croft