nanoVLM support
Question
I would like to know if there is any plan to support models built with nanoVLM [https://github.com/huggingface/nanoVLM], thanks.
It shouldn't be too difficult to write a conversion script to ONNX, since all the components are already supported by transformers.js. We can simplify this a lot if we can load the different modules with HF transformers (vision encoder & llama decoder). Do you know if that's possible (i.e., make it compatible with LlamaForCausalLM and SiglipModel)?
cc @lusxvr
I'm working on a pull request for nanoVLM to convert to ONNX the various components. After that I plan to make some compatibility tests with HF transformers.
Yes, it should be possible! We load the pretrained weights from SmolLM2 and Siglip from the hub. Additionally, by now the architecture is essentially the same as SmolVLM, only the weights have different names.
There is also a PR (a bit stale unfortunately) that converts the SmolVLM weights to nanoVLM, so it could serve as an inspiration on how to go the opposite direction.
https://github.com/huggingface/nanoVLM/pull/74
Apart from this I think the base building blocks are compatible with LlamaForCausalLM and SiglipModel out of the box and the connector is a simple MLP