Running fine tuned Florence-2 model locally
I would like to run a local vision model that is ultimately fine-tuned with domain specific images and data on PCs and use NPU (Snapdragon Hexagon) or GPU depending on the PC. Florence 2 (ONNX) seems the perfect candidate but will not run with NPU using the Microsoft.ML.OnnxRuntime.QNN library. However, I understand that Microsoft is doing exactly that in the Microsoft Vision SDK. How do I reuse this in a way that I can fine tune the Florence-2 model? (Or where can I find the Florence-2 model that Microsoft is using running natively with Hexagon NPU?) Thanks!
Hi @v-croft
Microsoft is indeed leveraging Florence-2 in the Vision SDK, and while ONNX Runtime QNN doesn’t yet support Florence-2 directly on Hexagon NPU, there are promising paths forward.
Florence-2 ONNX Availability
- Microsoft has released Florence-2 models on Hugging Face in PyTorch format, and the community (via Xenova) has converted them to ONNX:
- These ONNX models are suitable for local inference on GPU and CPU, but not yet optimized for Hexagon NPU via QNN.
Why QNN Doesn’t Work Yet
- Florence-2 uses a causal decoder architecture with vision-language fusion, which isn’t fully supported by QNN’s current operator set.
- QNN excels with CNNs and lightweight transformers, but Florence-2’s sequence-to-sequence design and dynamic token generation pose challenges.
What Microsoft Is Doing in Vision SDK
- The Microsoft Vision SDK uses Florence-2 variants optimized for edge deployment, including:
- Quantized models
- Operator fusion tailored for Hexagon
- Custom runtime layers not exposed via public ONNX/QNN APIs
These models are not publicly released in ONNX form for fine-tuning, but they are used internally for tasks like OCR, object detection, and captioning on Surface and Snapdragon devices see EdgeAI for Beginners Windows AI PC Developers and the samples and AI Dev Gallery Windows 11 app.
Reuse and Fine-Tune Florence-2
1. Start with Hugging Face PyTorch Models
- Use
microsoft/Florence-2-baseorFlorence-2-largefrom Hugging Face - Fine-tune using domain-specific data with PyTorch + Hugging Face Transformers
- Example fine-tuning guide: Genspark Tutorial and Microsoft Olive
2. Convert to ONNX After Fine-Tuning, OLIVE does this natively
- Use
torch.onnx.export()to convert your fine-tuned model - Optimize with
onnxruntime-toolsand quantize if needed
3. Test with QNN or Fallback to GPU
- Try loading the ONNX model with
Microsoft.ML.OnnxRuntime.QNN - If unsupported ops block NPU execution, fallback to GPU or CPU
- Consider segmenting Florence-2 into submodules (e.g., vision encoder + text decoder) and running only the encoder on NPU
Note
- Microsoft has acknowledged Florence-2 ONNX support as a roadmap item
- Stay tuned to onnxruntime-genai GitHub for updates on QNN compatibility and SDK releases
Hi @leestott Thank you for the quick and comprehensive answer. I had tried most of what you suggested and your summary leads me to think that this is not going to work currently. The problem seems to be that the model, as it currently is, will not work with the QNN execution layer. In practice, I can only get the decoder to run natively on the NPU; the vision model and encoder both fail. The QNN documentation suggests there is still a lot of work to be done to properly support ONNX models like Florence-2. It is a pity to provide good NPU hardware (seemingly even better with the next Snapdragon Hexagon iteration) but without the technical layers to be able to use them easily! It would be great if Microsoft could make its QNN tailored version of Florence-2 available for training. Or am I missing something?