FR: Utilise vision models via image attachments
Currently, Sidekick only allows document file uploads but does not support actual image analysis. Adding vision model capabilities would allow users to upload images for processing, enabling object recognition, scene description, text extraction (OCR), and other insights.
@juioudgdgeue894
Thanks for the suggestion! 🙏
Sidekick uses llama.cpp's llama-server for inference. Unfortunately, it doesn't currently support image attachments. However, they are actively working on adding support for this feature.
It seems like a lot of folks want this feature, so I will most likely add support for image attachments (local and remote inference) to Sidekick when llama.cpp brings VLM support to llama-server.
That being said, if anyone wants to implement this feature immediately for remote inference right away, I'm open to a PR!
Ahh - understood. In that case, I look forward to it being implemented. It'll only take Sidekick up an even further notch!
Wish I could contribute - unfortunately I'm not a programmer! Happy to continue bug testing and feature requesting though. Love the app!
@juioudgdgeue894
Might be a bit ambitious, but when VLM support comes around, I'll see if I can extend resource search in experts to calculate embeddings for images as well, so RAG can be done on images as well.
@juioudgdgeue894
As of commit #4378a61, support has been added for remote VLMs. This has been tested with OpenRouter and Alibaba Cloud.
@juioudgdgeue894
Vision support has now been added to llama-server!
I'll work on supporting local VLMs like Gemma 3.