gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

Support multimodal models such as LLaVA for image input

Open cebtenzzre opened this issue 8 months ago • 4 comments

Feature request

We can make use of the upstream work at https://github.com/ggerganov/llama.cpp/pull/3436 to support image input to LLMs.

@AndriyMulyar What was the name of the model that you wanted to consider as an alternative to LLaVA?

Motivation

Real-time image recognition on resource-constrained hardware would be very useful in applications such as robotics. This feature would open the door to broader use cases for GPT4All than simple text completion.

Your contribution

I may submit a pull request implementing this functionality.

cebtenzzre avatar Oct 24 '23 18:10 cebtenzzre

Fuyu 8b is interesting because its decoder only.

I think LLaVA style is a fine choice though for an initial multimodal implementation

AndriyMulyar avatar Oct 24 '23 19:10 AndriyMulyar

This will require extensive changes to the GUI as well. It has been agreed that the GUI changes will come first to provide a UI for the current multimodel upstream.

manyoso avatar Oct 24 '23 22:10 manyoso

+1

eiko4 avatar Oct 31 '23 11:10 eiko4

+1

PedzacyKapec avatar Dec 01 '23 18:12 PedzacyKapec