Enhancing SAM3 with VLM-FO1 for Complex Text Label Tasks

Open P3ngLiu opened this issue 1 month ago • 1 comments

First of all, thanks to the SAM3 team for your excellent work on text-conditioned segmentation. SAM3's powerful capabilities have been instrumental in enabling this integration.

I'm excited to share a interesting Gradio demo that combines SAM3 with VLM-FO1 to enhance detection and segmentation performance on complex, compositional text label tasks! This integration leverages SAM3's powerful text-conditioned segmentation capabilities and VLM-FO1's fine-grained perception reasoning to achieve more reliable results on challenging prompts.

✨ What Makes This Special?

🎯 Better Complex Label Handling: The combination excels at understanding compositional prompts like "airplane with letter AE on its body" or "the lying cat which is not black"

🎮 Try It Now!

🌐 Live Demo: Hugging Face Space Or you can run it by yourself at ⭐ GitHub Repo: VLM-FO1

📊 How It Works

SAM3 generates initial detections and masks based on text prompts
VLM-FO1 processes these proposals with its fine-grained reasoning capabilities
The pipeline filters and labels the results, providing both raw SAM3 outputs and refined VLM-FO1 predictions

This is an initial exploration and proof-of-concept. We will work on further improvements later.

Nov 21 '25 21:11 P3ngLiu