gpt4all Support multimodal models such as LLaVA for image input

Support multimodal models such as LLaVA for image input

Open cebtenzzre opened this issue 1 year ago • 4 comments

Feature request

We can make use of the upstream work at https://github.com/ggerganov/llama.cpp/pull/3436 to support image input to LLMs.

@AndriyMulyar What was the name of the model that you wanted to consider as an alternative to LLaVA?

Motivation

Real-time image recognition on resource-constrained hardware would be very useful in applications such as robotics. This feature would open the door to broader use cases for GPT4All than simple text completion.

Your contribution

I may submit a pull request implementing this functionality.

Oct 24 '23 18:10 cebtenzzre

Fuyu 8b is interesting because its decoder only.

I think LLaVA style is a fine choice though for an initial multimodal implementation

Oct 24 '23 19:10 AndriyMulyar

This will require extensive changes to the GUI as well. It has been agreed that the GUI changes will come first to provide a UI for the current multimodel upstream.

Oct 24 '23 22:10 manyoso

Oct 31 '23 11:10 eiko4

Dec 01 '23 18:12 PedzacyKapec

gpt4all gpt4all copied to clipboard

Support multimodal models such as LLaVA for image input

Feature request

Motivation

Your contribution

gpt4all
gpt4all copied to clipboard