LLaVA
LLaVA copied to clipboard
[Usage] How can I implemet few shot learning on LLaVa
Describe the issue
Hi there,
I have some images and some custom explain. So I want to implement few shot learning to make summaries of my images.
This is my current implement:
templates = [
{
"url": "",
"explain": """""",
},
{
"url": "",
"explain": """""",
},
{
"url": "",
"explain": """"""
},
{
"url": ",
"explain": """"""
},
{
"url": "",
"explain": """"""
},
]
My code to build prompt:
from PIL import Image
import cv2
import numpy as np
import requests
"""Make image summary"""
img_prompt = "User: <image>\n"+"\nASSISTANT:"
prompt = (
"You are an assistant tasked with summarizing images for retrieval. "
"These summaries will be embedded and used to retrieve the raw image. "
"Give a concise summary of the image that is well optimized for retrieval."
)
print(prompt)
images = []
for i, temp in enumerate(templates):
image_i = Image.open(requests.get(temp['url'], stream=True).raw)
eplain_i = temp["explain"]
example_i = f"\nUser: <image{i}>"+"\nASSISTANT:" + eplain_i + "\n"
prompt += example_i
images.append(image_i)
prompt += f"\nUser: <image{len(templates)}>"+"\nASSISTANT:"
print(prompt)
print('-'*100)
print("Examples:", len(images))
Inference:
target = Image.open("figures/figure-2-5.jpg")
out = model_multi_modals(
images=images+[target],
prompt=prompt,
generate_kwargs={"max_new_tokens": 2048})
And my error:
ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.
In-context learning or fine tuning
That's an excellent question. Similar to OpenAI GPT models, we can enhance them through a few-shot approach. It would be fantastic if we could apply the same method to these pre-trained models. @haotian-liu
Is it solved? Because I use SGLang for batch inference and I also need this feature for ICL and multiple discussions or few shot.
image{len(templates)}
Describe the issue
Hi there,
I have some images and some custom explain. So I want to implement few shot learning to make summaries of my images.
This is my current implement:
templates = [ { "url": "", "explain": """""", }, { "url": "", "explain": """""", }, { "url": "", "explain": """""" }, { "url": ", "explain": """""" }, { "url": "", "explain": """""" }, ]
My code to build prompt:
from PIL import Image import cv2 import numpy as np import requests """Make image summary""" img_prompt = "User: <image>\n"+"\nASSISTANT:" prompt = ( "You are an assistant tasked with summarizing images for retrieval. " "These summaries will be embedded and used to retrieve the raw image. " "Give a concise summary of the image that is well optimized for retrieval." ) print(prompt) images = [] for i, temp in enumerate(templates): image_i = Image.open(requests.get(temp['url'], stream=True).raw) eplain_i = temp["explain"] example_i = f"\nUser: <image{i}>"+"\nASSISTANT:" + eplain_i + "\n" prompt += example_i images.append(image_i) prompt += f"\nUser: <image{len(templates)}>"+"\nASSISTANT:" print(prompt) print('-'*100) print("Examples:", len(images))
Inference:
target = Image.open("figures/figure-2-5.jpg") out = model_multi_modals( images=images+[target], prompt=prompt, generate_kwargs={"max_new_tokens": 2048})
And my error:
ValueError: The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.
I think The error is because of the image token. In the prompt, the image token should be given as:
<image>
and not by image id or image index. I got a similar error in my setup for multi-prompt.
BTW, the model is not capable of performing directly on multiple images and prompts simultaneously, as is evident from the following conversations by the author and others.
https://discuss.huggingface.co/t/llava-multi-image-input-support-for-inference/68458
https://github.com/haotian-liu/LLaVA/issues/197#:~:text=Due%20to%20the%20current%20way%20of%20training%2C%20we%20do%20not%20observe%20the%20model%20having%20very%20good%20capability%20referring%20to%20/%20comparing%20with%20multiple%20images.%20We%20are%20working%20on%20improving%20this%20aspect%20as%20well%2C%20stay%20tuned!
https://github.com/haotian-liu/LLaVA/issues/57#:~:text=Due%20to%20the%20current%20way%20of%20training%2C%20we%20do%20not%20observe%20the%20model%20having%20very%20good%20capability%20referring%20to%20/%20comparing%20with%20multiple%20images.
https://huggingface.co/YouLiXiya/tinyllava-v1.0-1.1b-hf/discussions/1#:~:text=The%20training%20is%20based%20on%20a%20single%20image.%20Multiple%20images%20are%20not%20supported
Hi guys, you can use our implemented codebase for ICL. https://github.com/ys-zong/VL-ICL