llama icon indicating copy to clipboard operation
llama copied to clipboard

improve LLaMA for visual understanding like GPT-4

Open feizc opened this issue 1 year ago • 2 comments

Thanks for the good works!

We have tried to improve LLaMa model to understand visual information and support multi-modal chatting. We are inspired that a good vit, e.g., CLIP vision encoder, and a well-trained large language model, e.g., LLaMA, with connection network, e.g., MLP or Transformer, can cover visual applications, like PALM-E.

The results in image captioning, VQA, and more multi-modal tasks, are promising in 7B and we call on more people to support testing of larger models.

Github: https://github.com/feizc/Visual-LLaMA

  • [X] fine-tuning scripts and hyper-parameters setting
  • [X] datasets for fine-grained alignment and instruct tuning
  • [x] interactive gradio and visual chatbot

feizc avatar Apr 05 '23 08:04 feizc

Cool, does it show to understand spacial relationship like GPT-4?

archytasos avatar Apr 11 '23 12:04 archytasos

Cool, does it show to understand spacial relationship like GPT-4?

I observed in mPLUG-Owl has such capability, which is updated recently! It's exciting!

Repo can be found here: https://github.com/X-PLUG/mPLUG-Owl

vateye avatar Apr 27 '23 13:04 vateye

I am interested in this issue

shkr avatar Jul 10 '23 10:07 shkr

@feizc, this is great. Thanks for sharing your work and contributing!

ejsd1989 avatar Sep 06 '23 17:09 ejsd1989