transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add FastVLM from CVPR 2025

Open kamila-chay opened this issue 5 months ago • 2 comments

Model description

Apple recently released FastVLM, a new vision-language model introduced at CVPR 2025, which significantly improves on previous models in the LLaVA family.

The smallest FastVLM variant outperforms LLaVA-OneVision-0.5B, achieving:

  • 85× faster Time-to-First-Token (TTFT)
  • 3.4× smaller vision encoder

Its ultra-low latency enables real-time deployment on mobile and edge devices, as demonstrated in the official demo. Given its strong performance and efficiency, I believe FastVLM would be a valuable addition to this repo.

If you also think so, I'd love to contribute this model.

Open source status

  • [x] The model implementation is available
  • [x] The model weights are available

Provide useful links for the implementation

Paper: https://www.arxiv.org/abs/2412.13303 Official repo: https://github.com/apple/ml-fastvlm

kamila-chay avatar Jun 11 '25 17:06 kamila-chay

Hey @kamila-chay !

Yes, adding the model is already planned afaik. @ariG23498 had a nice inference script using transformers under this comment. To support the model in core library we need to convert weights, and we're waiting for Apple team on that I think

zucchini-nlp avatar Jun 12 '25 07:06 zucchini-nlp

Hi @zucchini-nlp! Thank you for the answer, I didn't see the thread. Glad to hear it's already planned. You can mention me here if you need community support for this model in the future (to add new components etc, not quite sure what will be needed)

kamila-chay avatar Jun 12 '25 14:06 kamila-chay

Now that it is public from apple, I think we can start working on it

Collection: https://huggingface.co/collections/apple/fastvlm-68ac97b9cd5cacefdd04872e

ariG23498 avatar Aug 29 '25 07:08 ariG23498

That's great to hear! I'm still willing to contribute, @ariG23498 let me know if i can open a PR and work on it. I have some free time now so I can start right away 💯

kamila-chay avatar Aug 29 '25 10:08 kamila-chay

@kamila-chay that is great to hear.

@zucchini-nlp is that okay with you?

ariG23498 avatar Aug 29 '25 10:08 ariG23498

Sure, thanks @kamila-chay ! I haven't personally looked at model arch, it might be compatible with an existing llava or llava-next model code. Can you first check that and if yes, adapt a conversion script

It makes long term maintenance easier if we don't need to add new model class

zucchini-nlp avatar Aug 29 '25 10:08 zucchini-nlp

Sure I will do that!

kamila-chay avatar Aug 29 '25 10:08 kamila-chay

Hi @zucchini-nlp, @ariG23498, I just wanted to check in to make sure we’re aligned. I noticed that remote code was added to the official Apple repos on the HF hub 2 days ago, and I wasn’t sure if you still plan to integrate this model into the core library in that case. There’s quite a bit of new code to bring in if we go that route, so I wanted to clarify what the plan is 😃

kamila-chay avatar Aug 31 '25 18:08 kamila-chay

I am fine with bringing the code in transformers, though let's ask @ariG23498 about what we have agreed on with Apple team. Just in case anything changed over the weekend :)

zucchini-nlp avatar Aug 31 '25 20:08 zucchini-nlp

Hi @ariG23498, any info on that?

kamila-chay avatar Sep 08 '25 08:09 kamila-chay

We are also looking to enable this model on ExecuTorch - would love to see this model in Transformers! Any plans on integrating the HF hub modeling code into the main repo?

jackzhxng avatar Sep 16 '25 03:09 jackzhxng

Hey @kamila-chay sorry for the delay. It looks like we are on! You can start working on the FastVLM integration to transformers.

ariG23498 avatar Sep 16 '25 10:09 ariG23498

Looking forward to it @kamila-chay 🙏

jackzhxng avatar Sep 18 '25 19:09 jackzhxng