llama improve LLaMA for visual understanding like GPT-4

improve LLaMA for visual understanding like GPT-4

Open feizc opened this issue 1 year ago • 2 comments

Thanks for the good works!

We have tried to improve LLaMa model to understand visual information and support multi-modal chatting. We are inspired that a good vit, e.g., CLIP vision encoder, and a well-trained large language model, e.g., LLaMA, with connection network, e.g., MLP or Transformer, can cover visual applications, like PALM-E.

The results in image captioning, VQA, and more multi-modal tasks, are promising in 7B and we call on more people to support testing of larger models.

Github: https://github.com/feizc/Visual-LLaMA