gpt4all
gpt4all copied to clipboard
Support multimodal models such as LLaVA for image input
Feature request
We can make use of the upstream work at https://github.com/ggerganov/llama.cpp/pull/3436 to support image input to LLMs.
@AndriyMulyar What was the name of the model that you wanted to consider as an alternative to LLaVA?
Motivation
Real-time image recognition on resource-constrained hardware would be very useful in applications such as robotics. This feature would open the door to broader use cases for GPT4All than simple text completion.
Your contribution
I may submit a pull request implementing this functionality.
Fuyu 8b is interesting because its decoder only.
I think LLaVA style is a fine choice though for an initial multimodal implementation
This will require extensive changes to the GUI as well. It has been agreed that the GUI changes will come first to provide a UI for the current multimodel upstream.
+1
+1