Sidekick FR: Utilise vision models via image attachments

Currently, Sidekick only allows document file uploads but does not support actual image analysis. Adding vision model capabilities would allow users to upload images for processing, enabling object recognition, scene description, text extraction (OCR), and other insights.

Mar 18 '25 12:03 juioudgdgeue894

@juioudgdgeue894

Thanks for the suggestion! 🙏

Sidekick uses llama.cpp's llama-server for inference. Unfortunately, it doesn't currently support image attachments. However, they are actively working on adding support for this feature.

It seems like a lot of folks want this feature, so I will most likely add support for image attachments (local and remote inference) to Sidekick when llama.cpp brings VLM support to llama-server.

That being said, if anyone wants to implement this feature immediately for remote inference right away, I'm open to a PR!

Mar 18 '25 14:03 johnbean393

Ahh - understood. In that case, I look forward to it being implemented. It'll only take Sidekick up an even further notch!

Wish I could contribute - unfortunately I'm not a programmer! Happy to continue bug testing and feature requesting though. Love the app!

Mar 19 '25 10:03 juioudgdgeue894

@juioudgdgeue894

Might be a bit ambitious, but when VLM support comes around, I'll see if I can extend resource search in experts to calculate embeddings for images as well, so RAG can be done on images as well.

Mar 30 '25 04:03 johnbean393

@juioudgdgeue894

As of commit #4378a61, support has been added for remote VLMs. This has been tested with OpenRouter and Alibaba Cloud.

Apr 08 '25 04:04 johnbean393

@juioudgdgeue894

Vision support has now been added to llama-server!

I'll work on supporting local VLMs like Gemma 3.

May 10 '25 07:05 johnbean393