llama
llama copied to clipboard
improve LLaMA for visual understanding like GPT-4
Thanks for the good works!
We have tried to improve LLaMa model to understand visual information and support multi-modal chatting. We are inspired that a good vit, e.g., CLIP vision encoder, and a well-trained large language model, e.g., LLaMA, with connection network, e.g., MLP or Transformer, can cover visual applications, like PALM-E.
The results in image captioning, VQA, and more multi-modal tasks, are promising in 7B and we call on more people to support testing of larger models.
Github: https://github.com/feizc/Visual-LLaMA
- [X] fine-tuning scripts and hyper-parameters setting
- [X] datasets for fine-grained alignment and instruct tuning
- [x] interactive gradio and visual chatbot
Cool, does it show to understand spacial relationship like GPT-4?
Cool, does it show to understand spacial relationship like GPT-4?
I observed in mPLUG-Owl has such capability, which is updated recently! It's exciting!
Repo can be found here: https://github.com/X-PLUG/mPLUG-Owl
I am interested in this issue
@feizc, this is great. Thanks for sharing your work and contributing!