InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Feature] how to do in-context learning(few-shot learning) with internvl2

Open Claude-Liu opened this issue 1 year ago • 4 comments

Motivation

I notice that internval_chat/eval/evaluate_vqa.py has parameters for few-shot learning but have not been implemented correctly.

My question is: How can we do few-shot learning with internvl2?

  1. should we use the multi-round conversation as https://huggingface.co/OpenGVLab/InternVL2-8B to do few-shot learning: #single-image multi-round conversation (单图多轮对话) question = '\nPlease describe the image in detail.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True) print(f'User: {question}\nAssistant: {response}') question = 'Please write a poem according to the image.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True) print(f'User: {question}\nAssistant: {response}')

  2. can internvl2 process interleaved image-text input in a single round like qwen2-vl. Here is an example of interleaved image-text input: \nDescribe the two images in detail. the answer is: xxx \n \nDescribe the two images in detail. the answer is: yyy \n Describe the two images in detail.

Related resources

No response

Additional context

No response

Claude-Liu avatar Sep 06 '24 07:09 Claude-Liu

Hello @Claude-Liu !

How did you solve the issue? Have you find a way to perform multimodal ICL with the internvl2? Can you provide an example?

agadetsky avatar Sep 07 '24 22:09 agadetsky

Hi, I try to add the feature of icl in evaluate_vqa.py like below. However 1,3,5 shot decrease the performance of intervl-8b on textvqa, vizwiz and okvqa. I reopen this issue and welcome discussions and possible solutions from the owners of the repo.


if self.few_shot > 0:
    pixel_values_list = []
    num_patches_list = []
    few_shot_samples = random.sample(self.train, self.few_shot)
    for sample in few_shot_samples:
        sample = json.loads(sample.strip())
        few_shot_prompt += self.image_pad + "\n" + sample["question"] + ' ' + self.prompt +"\n"+ sample['answer'] + "\n"
        pixel_values_ = load_image(os.path.join('/mnt/workspace/liulf/', sample['image']), input_size=self.input_size, max_num=self.max_num)
        pixel_values_list.append(pixel_values_)
        num_patches_list.append(pixel_values_.size(0))

if self.few_shot == 0:
    pixel_values = load_image(image, input_size=self.input_size, max_num=self.max_num)
else:
    pixel_values_ = load_image(image, input_size=self.input_size, max_num=self.max_num)
    pixel_values_list.append(pixel_values_)
    num_patches_list.append(pixel_values_.size(0))
    pixel_values = torch.cat(pixel_values_list, dim=0)

if len(self.prompt) != 0:
    question = self.image_pad + "\n" + question + ' ' + self.prompt

if len(few_shot_prompt) != 0:
    question = few_shot_prompt + ' ' + question```


Claude-Liu avatar Sep 10 '24 03:09 Claude-Liu

We also notice this issue when we develop OmniCorpus dataset.

A straightforward solution is to organize the VQA data into an in-context learning format during the training phase. This can alleviate the issue to some extent, but there is still a performance drop in the 32-shot setting. We believe this might be because ICL changes the distribution of the model's outputs, causing the predicted results to fail to match the ground truth through string matching, even when the semantics are correct.

We believe that a more comprehensive benchmark is needed to evaluate the multimodal in-context learning ability, as the few-shot VQA setting does not adequately reflect the model's capability.

Weiyun1025 avatar Sep 10 '24 16:09 Weiyun1025

Thank you for your quick and clear response!

"We believe this might be because ICL changes the distribution of the model's outputs, causing the predicted results to fail to match the ground truth through string matching, even when the semantics are correct."

Does that mean that the format of the outputs using icl will change (for example have some prefixes like "let me answer your question"), or just fail to exactly match the 10 ground truths while keeping the semantics.

Could you please give some badcases of this kind?

Thank you sincerely!

Claude-Liu avatar Sep 11 '24 03:09 Claude-Liu